# Test Infrastructure Hardening — Implementation Plan > **For Tier 3 workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. > > **Tier 2 supervision required:** Phase 1, Phase 3 (the conftest refactor), and Phase 4 (the `_sync_rag_engine` race fix) MUST be supervised by a Tier 2 Tech Lead. These touch the session-scoped `live_gui` fixture and the controller's hot path; the prior attempt at the conftest refactor was reverted due to corruption (see `docs/reports/rag_test_batch_failure_status_20260609_pm3.md`). **Goal:** Fix the 3 root causes of test regression churn (subprocess state pollution, filesystem path hygiene, io_pool race) + 2 related bugs (set_value hook, optional clean-baseline) so the 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) start from a known green baseline. **Architecture:** Each phase is self-contained, with TDD: failing test first, then minimum implementation, then verify pass, then commit. Per-task atomic commits. No batching. **Tech Stack:** Python 3.11+, pytest, FastAPI/Uvicorn (live_gui), tmp_path_factory, threading.Lock. --- ## Pre-Phase 0: Tier 2 checkpoint + dirty-state audit Before starting Phase 1, the Tier 2 Tech Lead must: - [ ] **Step 0.1: Read all referenced reports** - `docs/reports/rag_test_batch_failure_status_20260609_pm3.md` (filesystem hygiene findings) - `docs/reports/rag_work_final_20260609_pm.md` (io_pool race, set_value hook) - `docs/reports/test_infra_hardening_foundation_20260608.md` (foundation, 5 phases) - `docs/reports/batch_resilience_plan_20260608.md` (4 batch-resilience solutions) - `conductor/edit_workflow.md` (surgical tool guidance) - [ ] **Step 0.2: Verify the dirty working tree is safe** - Working tree currently has uncommitted changes in `config.toml`, `manualslop_layout.ini`, `project_history.toml`, `src/warmup.py`. These are user workspace artifacts, NOT test infrastructure. - **Do NOT commit these.** They are out of scope. - Use `git stash --keep-index` or commit them separately if the user requests. - [ ] **Step 0.3: Run the current batch baseline to capture "before" state** ```powershell cd C:\projects\manual_slop; uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\batch_baseline_20260609.log" | Select-Object -Last 50 ``` Expected: tier-1 + tier-2 pass; tier-3 has the documented failures (RAG dim-mismatch, set_value hook, RAG phase4 final verify, RAG phase4 stress). --- ## Phase 1: Audit (no code changes) Focus: Catalog the existing state so Phases 2-7 have a data-grounded baseline. ### Task 1.1: Enumerate `live_gui` test cross-file state dependencies **Files:** - Read: `tests/conftest.py:282-547` (the `live_gui` fixture) - Read: all 49+ test files that use `live_gui` - [ ] **Step 1.1.1: Generate the live_gui test inventory** ```powershell cd C:\projects\manual_slop; uv run python -c " from pathlib import Path import re root = Path('tests') files = sorted(root.glob('test_*.py')) users = [] for f in files: text = f.read_text(encoding='utf-8') if 'live_gui' in text: users.append(f.name) print(f'{len(users)} test files use live_gui:') for u in users: print(f' {u}') " ``` Save output to `conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_users.txt`. - [ ] **Step 1.1.2: For each live_gui test file, grep for `set_value` calls and `get_value` calls** ```powershell cd C:\projects\manual_slop; foreach ($f in (Get-Content conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_users.txt)) { Write-Host "=== $f ==="; Select-String -Path "tests\$f" -Pattern '(set_value|get_value|reset_session)' | Select-Object LineNumber, Line | Format-Table -AutoSize } | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_state_io.txt" ``` Save output to the audit directory. This shows which tests read state set by other tests. - [ ] **Step 1.1.3: Categorize each test as "self-contained" or "cross-test-dependent"** Self-contained = no `set_value` calls OR all `set_value` calls are within the same test function. Cross-test-dependent = has `get_value` calls that depend on a prior test's `set_value`. Save the categorization to `conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_dependencies.json`: ```json { "self_contained": ["test_a.py", "test_b.py", ...], "cross_test_dependent": ["test_x.py::test_y", ...] } ``` - [ ] **Step 1.1.4: Commit the audit** ```powershell cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/ git commit -m "conductor(audit): catalog live_gui test cross-file state dependencies" ``` ### Task 1.2: Document the current `live_gui_workspace` path-hygiene state - [ ] **Step 1.2.1: Find all hardcoded references to `tests/artifacts/live_gui_workspace`** ```powershell cd C:\projects\manual_slop; rg -n "tests/artifacts/live_gui_workspace" tests/ --type py | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_paths.txt" ``` Expect 7+ matches per the spec's "Files affected" list. - [ ] **Step 1.2.2: Find all `Path("C:/projects/")` or `Path("C:\\\\projects\\\\")` references in test files** ```powershell cd C:\projects\manual_slop; rg -n 'Path\("C:[/\\]+projects' tests/ --type py | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_project_root.txt" ``` Expect 0+ matches (the spec says none in production code; verify in tests too). - [ ] **Step 1.2.3: Commit the audit** ```powershell cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_*.txt git commit -m "conductor(audit): document hardcoded workspace paths in test suite" ``` ### Task 1.3: Document the current `_sync_rag_engine` race - [ ] **Step 1.3.1: Read `src/app_controller.py:_sync_rag_engine` and its callers** Use `manual-slop_py_get_definition` to read `_sync_rag_engine`. Identify: - The set of setters that trigger sync (e.g., `rag_collection_name`, `files`, `rag_enabled`, `rag_source`, `rag_emb_provider`). - The submit-to-io_pool call site. - Whether there's any existing coalescing/debouncing. - [ ] **Step 1.3.2: Write the audit to `conductor/tracks/test_infrastructure_hardening_20260609/audit/sync_rag_race.md`** Format: ```markdown # _sync_rag_engine Race Audit ## Setters that trigger sync - `set_rag_collection_name` (src/app_controller.py:N) - `set_rag_enabled` (src/app_controller.py:N) - `set_files` (src/app_controller.py:N) - ... ## Submit pattern [paste 5-10 lines of the submit call] ## Coalescing mechanism [None / Token-based / Lock-based / etc.] ## Race scenario 1. Test fires setter A → submit task T1 2. Test fires setter B (50ms later) → submit task T2 3. T1 starts on io_pool thread, starts constructing RAGEngine 4. T2 starts on a different io_pool thread, starts constructing RAGEngine 5. T1 finishes first, sets self.rag_engine = engine_A 6. T2 finishes, sets self.rag_engine = engine_B 7. Test queries self.rag_engine → engine_B (last writer wins) 8. engine_B may not have indexed the file from setter A → test fails ``` - [ ] **Step 1.3.3: Commit the audit** ```powershell cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/sync_rag_race.md git commit -m "conductor(audit): document _sync_rag_engine race in controller" ``` ### Task 1.4: Document the `set_value` hook for `ai_input` - [ ] **Step 1.4.1: Read `src/api_hooks.py` `/api/gui/set_value` endpoint** Use `manual-slop_py_get_definition` to find the endpoint. Identify the parameter-to-handler mapping. - [ ] **Step 1.4.2: Read `src/gui_2.py:__setattr__` and the `_UI_FLAG_DEFAULTS` allowlist** Use `manual-slop_py_get_definition` to read both. Verify the allowlist is in place (from commit `bcdc26d0`). - [ ] **Step 1.4.3: Test the failing case directly via the live_gui fixture** Write a diagnostic test (NOT yet committed) that: 1. Gets the live_gui fixture. 2. Calls `client.set_value('ai_input', 'hello')`. 3. Waits 0.5s. 4. Calls `client.get_value('ai_input')`. 5. Prints the result. Run with: `cd C:\projects\manual_slop; uv run pytest -s -xvs --no-header tests/test_gui2_set_value_hook_works.py 2>&1 | Select-Object -Last 30` If the test fails, read the API hooks endpoint to find the missing branch. - [ ] **Step 1.4.4: Write the audit to `conductor/tracks/test_infrastructure_hardening_20260609/audit/set_value_hook.md`** Format: ```markdown # set_value('ai_input') Audit ## Endpoint code path [paste the relevant 10-20 lines from /api/gui/set_value] ## Expected flow 1. POST /api/gui/set_value with {"field": "ai_input", "value": "hello"} 2. Endpoint calls controller.set_ai_input("hello") (or similar) 3. Controller sets self.ai_input = "hello" 4. Subsequent get_value('ai_input') returns "hello" ## Actual flow (from diagnostic) 1. POST returns 'queued' 2. Controller does NOT set self.ai_input 3. Subsequent get_value returns '' ## Root cause [Identify the missing branch — likely the /api/gui/set_value endpoint has a hardcoded list of fields it handles, and 'ai_input' is not on the list, OR the controller's __setattr__ drops the assignment] ## Fix location [src/api_hooks.py:N or src/gui_2.py:N] ``` - [ ] **Step 1.4.5: Commit the audit** ```powershell cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/set_value_hook.md git commit -m "conductor(audit): trace set_value('ai_input') flow to find routing bug" ``` ### Phase 1 verification - [ ] **Step 1.V.1: All 4 audit files committed** - `audit/live_gui_users.txt` - `audit/live_gui_state_io.txt` - `audit/live_gui_dependencies.json` - `audit/hardcoded_paths.txt` - `audit/hardcoded_project_root.txt` - `audit/sync_rag_race.md` - `audit/set_value_hook.md` - [ ] **Step 1.V.2: User reviews the audit** - Tier 2 Tech Lead presents the audit to the user. - User approves before Phase 2 begins. --- ## Phase 2: FR1 — Per-test subprocess health check + respawn Focus: Add an autouse fixture that recovers the `live_gui` subprocess before each test, when degraded. ### Task 2.1: Add a `_LiveGuiHandle` class with `ensure_alive()` **Files:** - Modify: `tests/conftest.py` (add `_LiveGuiHandle` class BEFORE the `live_gui` fixture) - [ ] **Step 2.1.1: Pre-edit checkpoint (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git stash push -m "wip before Phase 2" ``` (The current working tree has user workspace artifacts that should NOT be in this commit; stash them and re-apply after Phase 2's commit.) - [ ] **Step 2.1.2: Read the existing `live_gui` fixture** Read `tests/conftest.py:282-547` with `manual-slop_get_file_slice`. Note: - The current `live_gui` fixture creates a subprocess at line 412-450. - The fixture's `finally` block (line 516-547) calls `kill_process_tree`. - The fixture yields `(process, gui_script)` (a tuple). - [ ] **Step 2.1.3: Refactor `live_gui` to use a `_LiveGuiHandle` class** Insert a new class BEFORE the `live_gui` fixture (around line 280): ```python class _LiveGuiHandle: def __init__(self, gui_script: str, workspace: Path, log_path: Path) -> None: self._gui_script = gui_script self._workspace = workspace self._log_path = log_path self._process: subprocess.Popen | None = None self._lock = threading.Lock() self._respawn_count = 0 self._spawn() def _spawn(self) -> None: # Existing fixture spawn logic, lifted from conftest.py:412-450 # (use the actual spawn logic from the current fixture) ... def is_alive(self) -> bool: return self._process is not None and self._process.poll() is None def ensure_alive(self) -> None: with self._lock: if not self.is_alive(): self._respawn_count += 1 self._spawn() @property def process(self) -> subprocess.Popen: self.ensure_alive() assert self._process is not None return self._process @property def respawn_count(self) -> int: return self._respawn_count ``` **CRITICAL — 1-space indent.** Use the exact pattern from `tests/conftest.py`. Do not introduce 4-space indent. Use `manual-slop_py_add_def` to insert the class at `top` of the file. Verify the indent via `ast.parse` after insertion. - [ ] **Step 2.1.4: Refactor the `live_gui` fixture to use the handle** Change the fixture from yielding a tuple `(process, gui_script)` to yielding a `_LiveGuiHandle` instance. Update the docstring. Use `manual-slop_set_file_slice` to replace the fixture's body. Verify the indent. - [ ] **Step 2.1.5: Update all 49 live_gui tests to use the new API** The current pattern is: ```python def test_x(live_gui): process, gui_script = live_gui ``` The new pattern is: ```python def test_x(live_gui): handle = live_gui process = handle.process ``` This is a sweep across all 49 test files. Use `rg` to find all `process, gui_script = live_gui` lines, then sed/Python to replace. ```powershell cd C:\projects\manual_slop; uv run python -c " from pathlib import Path root = Path('tests') count = 0 for f in root.glob('test_*.py'): text = f.read_text(encoding='utf-8') if 'process, gui_script = live_gui' in text: new_text = text.replace('process, gui_script = live_gui', 'handle = live_gui') f.write_text(new_text, encoding='utf-8') count += 1 print(f'Updated {count} test files') " ``` This is a single in-place edit; verify with `git diff --stat`. - [ ] **Step 2.1.6: Run a representative live_gui test to verify the refactor** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30 ``` Expected: PASS. If FAIL, revert via `git checkout tests/` and re-investigate. - [ ] **Step 2.1.7: Commit the refactor (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add tests/ git commit -m "refactor(test): wrap live_gui subprocess in _LiveGuiHandle class" $h = git log -1 --format='%H' git notes add -m "Refactor the session-scoped live_gui fixture to yield a _LiveGuiHandle instead of a (process, gui_script) tuple. The handle has ensure_alive() and respawn_count. All 49 dependent test files updated to consume the handle. Foundation for the autouse _check_live_gui_health fixture in Task 2.2." $h ``` ### Task 2.2: Add the autouse `_check_live_gui_health` fixture **Files:** - Modify: `tests/conftest.py` (add the autouse fixture AFTER the `live_gui` fixture) - [ ] **Step 2.2.1: Write a failing test (TDD red)** Create `tests/test_live_gui_respawn.py`: ```python import pytest import time def test_live_gui_respawn_after_kill(live_gui): handle = live_gui initial_pid = handle.process.pid initial_respawn_count = handle.respawn_count handle.process.kill() handle.process.wait(timeout=5) assert not handle.is_alive() handle.ensure_alive() assert handle.is_alive() new_pid = handle.process.pid assert new_pid != initial_pid assert handle.respawn_count == initial_respawn_count + 1 def test_live_gui_no_respawn_on_clean(live_gui): handle = live_gui initial_count = handle.respawn_count handle.ensure_alive() assert handle.respawn_count == initial_count def test_live_gui_health_check_fast_path(live_gui): handle = live_gui t0 = time.perf_counter() handle.ensure_alive() elapsed = time.perf_counter() - t0 assert elapsed < 0.1, f"ensure_alive took {elapsed:.3f}s on a clean subprocess" ``` - [ ] **Step 2.2.2: Run the test to confirm it FAILS** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_respawn.py -v --timeout=30 ``` Expected: FAIL (the `respawn_count` attribute doesn't exist yet). - [ ] **Step 2.2.3: Add the autouse fixture to `tests/conftest.py`** Insert AFTER the `live_gui` fixture: ```python @pytest.fixture(autouse=True) def _check_live_gui_health(request, live_gui): if "live_gui" in request.fixturenames: handle = live_gui handle.ensure_alive() yield ``` Use `manual-slop_py_add_def` with anchor_type `after` and anchor_symbol `live_gui`. - [ ] **Step 2.2.4: Run the test to confirm it PASSES** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_respawn.py -v --timeout=30 ``` Expected: 3 tests PASS. - [ ] **Step 2.2.5: Run the full tier-3 live_gui batch to verify no regression** ```powershell cd C:\projects\manual_slop; uv run pytest tests/ -k "live_gui" -v --timeout=30 -x 2>&1 | Select-Object -Last 50 ``` Expected: Most tests pass; the documented failures (RAG dim-mismatch, set_value, RAG phase4) still fail. NO new failures. - [ ] **Step 2.2.6: Commit (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add tests/conftest.py tests/test_live_gui_respawn.py git commit -m "feat(test): autouse _check_live_gui_health recovers from degraded subprocess" $h = git log -1 --format='%H' git notes add -m "Adds an autouse fixture that calls handle.ensure_alive() before each test that uses live_gui. If the subprocess is dead, it respawns. If alive, the check is <100ms. Three new tests in tests/test_live_gui_respawn.py verify the respawn, the no-op-on-clean path, and the performance budget." $h ``` ### Phase 2 verification - [ ] **Step 2.V.1: 3 new tests in `tests/test_live_gui_respawn.py` pass** - [ ] **Step 2.V.2: No new regressions in tier-3 batch** - [ ] **Step 2.V.3: User reviews the autouse respawn behavior** - Per-test respawn adds <200ms per test. Verify with the 49 tests in batch. - User approves before Phase 3 begins. --- ## Phase 3: FR2 — `live_gui_workspace` fixture + update 6 test files Focus: Eliminate hardcoded `Path("tests/artifacts/live_gui_workspace")` from test files. Use `tmp_path_factory.mktemp`. **Tier 2 supervised for entire phase** (the prior attempt at this refactor was reverted due to corruption; see `docs/reports/rag_test_batch_failure_status_20260609_pm3.md`). ### Task 3.1: Refactor `live_gui` to use `tmp_path_factory.mktemp` **Files:** - Modify: `tests/conftest.py` (the `live_gui` fixture's workspace creation) - [ ] **Step 3.1.1: Pre-edit checkpoint** ```powershell cd C:\projects\manual_slop; git add . && git commit -m "wip: pre-Phase 3 checkpoint" --allow-empty ``` - [ ] **Step 3.1.2: Use `manual-slop_set_file_slice` to replace the workspace creation** Read `tests/conftest.py:410-414` with `manual-slop_get_file_slice`. Note the EXACT text of the lines that create the workspace (the `Path("tests/artifacts/live_gui_workspace")` reference and the surrounding `os.makedirs` or `mkdir` calls). Replace ONLY those lines with: ```python workspace = tmp_path_factory.mktemp("live_gui_workspace") ``` where `tmp_path_factory` is added to the fixture's parameters. The fixture signature changes from: ```python def live_gui(request): ``` to: ```python def live_gui(request, tmp_path_factory): ``` **CRITICAL — verify via `ast.parse` after the edit.** Use `manual-slop_py_check_syntax tests/conftest.py` to confirm syntax is valid. - [ ] **Step 3.1.3: Verify the fixture still spawns the subprocess correctly** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30 ``` Expected: PASS. If FAIL, the workspace path is being constructed wrong. - [ ] **Step 3.1.4: Verify the new workspace is a tmp dir (not under project tree)** Add a debug print to the test: ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30 -s 2>&1 | Select-String "workspace" ``` Expect: workspace is under `C:\Users\...\AppData\Local\Temp\...`, NOT `C:\projects\manual_slop\tests\artifacts\...`. - [ ] **Step 3.1.5: Commit (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add tests/conftest.py git commit -m "refactor(test): live_gui workspace via tmp_path_factory" $h = git log -1 --format='%H' git notes add -m "Replaces the hardcoded Path('tests/artifacts/live_gui_workspace') with tmp_path_factory.mktemp('live_gui_workspace'). The workspace now lives in pytest's tmp dir, not in the project tree. Foundation for exposing the workspace path as a separate fixture in Task 3.2." $h ``` ### Task 3.2: Expose `live_gui_workspace` as a separate fixture **Files:** - Modify: `tests/conftest.py` (add a new fixture) - [ ] **Step 3.2.1: Write a failing test (TDD red)** Create `tests/test_live_gui_workspace_fixture.py`: ```python from pathlib import Path def test_live_gui_workspace_is_absolute(live_gui_workspace): assert live_gui_workspace.is_absolute() def test_live_gui_workspace_unique_per_session(live_gui, live_gui_workspace): assert live_gui_workspace.exists() assert (live_gui_workspace / ".placeholder").exists() or True # fixture is empty def test_live_gui_workspace_passed_to_test(live_gui_workspace): test_file = live_gui_workspace / "test_file.txt" test_file.write_text("hello") assert test_file.read_text() == "hello" ``` - [ ] **Step 3.2.2: Run the test to confirm it FAILS** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_workspace_fixture.py -v --timeout=30 ``` Expected: FAIL (no `live_gui_workspace` fixture yet). - [ ] **Step 3.2.3: Add the `live_gui_workspace` fixture to `tests/conftest.py`** Insert AFTER the `live_gui` fixture: ```python @pytest.fixture def live_gui_workspace(live_gui) -> Path: handle = live_gui return handle._workspace # type: ignore[attr-defined] ``` The handle has the workspace as `_workspace` (set in Task 2.1.3). Use `manual-slop_py_add_def` with `anchor_type=after, anchor_symbol=live_gui`. - [ ] **Step 3.2.4: Run the test to confirm it PASSES** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_workspace_fixture.py -v --timeout=30 ``` Expected: 3 tests PASS. - [ ] **Step 3.2.5: Commit** ```powershell cd C:\projects\manual_slop; git add tests/conftest.py tests/test_live_gui_workspace_fixture.py git commit -m "feat(test): expose live_gui_workspace as a separate fixture" $h = git log -1 --format='%H' git notes add -m "Adds the live_gui_workspace fixture that returns the absolute path to the live_gui subprocess's workspace. Tests that need to create files in the workspace should request this fixture instead of hardcoding Path('tests/artifacts/live_gui_workspace')." $h ``` ### Task 3.3: Update the 6 dependent test files **Files:** - Modify: 6 test files that hardcode the workspace path - [ ] **Step 3.3.1: Read each test file and identify the hardcoded reference** For each of: - `tests/test_rag_phase4_final_verify.py:20` - `tests/test_rag_phase4_stress.py:21` - `tests/test_saved_presets_sim.py:14, 121` - `tests/test_tool_presets_sim.py:13` - `tests/test_visual_sim_gui_ux.py:79` Read the surrounding 5 lines with `manual-slop_get_file_slice` to understand the context. The pattern is: ```python workspace = Path("tests/artifacts/live_gui_workspace") workspace.mkdir(parents=True, exist_ok=True) # ... use workspace ``` - [ ] **Step 3.3.2: For each test file, refactor to use the fixture** For each file, do this surgical edit: 1. Add `live_gui_workspace` to the test function's parameter list. 2. Replace `Path("tests/artifacts/live_gui_workspace")` with `live_gui_workspace`. 3. Remove the `mkdir` call (the fixture creates the dir). 4. Use `live_gui_workspace.mkdir(parents=True, exist_ok=True)` ONLY if subsequent code needs the dir to exist before the fixture's init (rare). Use `manual-slop_edit_file` for each replacement. **One file at a time. Verify after each.** - [ ] **Step 3.3.3: Run each updated test file to verify the refactor** For each of the 6 files, run: ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_rag_phase4_final_verify.py -v --timeout=60 uv run pytest tests/test_rag_phase4_stress.py -v --timeout=60 uv run pytest tests/test_saved_presets_sim.py -v --timeout=60 uv run pytest tests/test_tool_presets_sim.py -v --timeout=60 uv run pytest tests/test_visual_sim_gui_ux.py -v --timeout=60 ``` Expected: Each file passes in isolation. If any fails, the refactor broke something — investigate. - [ ] **Step 3.3.4: Run the same files in batch to verify the BATCH failure is fixed** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py -v --timeout=120 ``` Expected: The RAG test PASSES after the 4 sims. **This is the primary symptom the user wanted fixed.** - [ ] **Step 3.3.5: Commit (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add tests/ git commit -m "refactor(test): 6 test files use live_gui_workspace fixture instead of hardcoded path" $h = git log -1 --format='%H' git notes add -m "The 6 test files that hardcoded Path('tests/artifacts/live_gui_workspace') now request the live_gui_workspace fixture, which yields the absolute path. The RAG test passes in batch (after 4 sims) for the first time, because the workspace path is now absolute and CWD-independent." $h ``` ### Phase 3 verification - [ ] **Step 3.V.1: 6 test files updated and pass in isolation** - [ ] **Step 3.V.2: RAG test passes in batch (after 4 sims)** — the primary goal - [ ] **Step 3.V.3: `tests/test_live_gui_workspace_fixture.py` 3 tests pass** - [ ] **Step 3.V.4: User reviews the RAG test passing in batch** - This is the "kill the nightmare" moment. User confirms before Phase 4. --- ## Phase 4: FR3 — Coalesce `_sync_rag_engine` calls Focus: Eliminate the io_pool race in `app_controller._sync_rag_engine` so multiple setters in quick succession produce one sync, not N parallel syncs. **Tier 2 supervised for entire phase.** This touches the controller's hot path. ### Task 4.1: Add a token-based coalescing mechanism **Files:** - Modify: `src/app_controller.py` (the `_sync_rag_engine` method and the setters that trigger it) - [ ] **Step 4.1.1: Read the existing `_sync_rag_engine` and the setters** Use `manual-slop_py_get_definition` on `AppController._sync_rag_engine`. Identify: - The exact submit-to-io_pool call. - The setters that call `_sync_rag_engine` (search for `_sync_rag_engine` usages). - [ ] **Step 4.1.2: Add the coalescing state to `AppController.__init__`** Add to `AppController.__init__` (use `manual-slop_py_set_var_declaration`): ```python self._rag_sync_token: int = 0 self._rag_sync_dirty: bool = False self._rag_sync_lock: threading.Lock = threading.Lock() ``` - [ ] **Step 4.1.3: Write a failing test (TDD red)** Create `tests/test_sync_rag_engine_coalescing.py`: ```python from unittest.mock import patch, MagicMock from src.app_controller import AppController def test_sync_rag_engine_coalesces_five_setters(): # Construct a minimal AppController (use existing fixture if available) # Patch the io_pool to count sync submissions with patch("src.app_controller.AppController._io_pool") as mock_pool: ctrl = AppController(...) for i in range(5): ctrl.set_rag_collection_name(f"name_{i}") # Assert: mock_pool.submit was called 0 times (or 1 time, with 5 setters coalesced) ... def test_sync_rag_engine_rerun_on_token_change(): ... def test_sync_rag_engine_idempotent_no_changes(): ... ``` Note: This test may require the existing test fixture for `AppController`. If no such fixture exists, use a minimal one (construct the controller with a tmp_path, mock the heavy dependencies). - [ ] **Step 4.1.4: Run the test to confirm it FAILS** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_sync_rag_engine_coalescing.py -v --timeout=30 ``` Expected: FAIL (no coalescing yet). - [ ] **Step 4.1.5: Refactor `_sync_rag_engine` to use the token + dirty flag** Use `manual-slop_py_update_definition` to replace `_sync_rag_engine`: ```python def _sync_rag_engine(self) -> None: with self._rag_sync_lock: self._rag_sync_token += 1 self._rag_sync_dirty = True token = self._rag_sync_token self._io_pool.submit(self._do_rag_sync, token) def _do_rag_sync(self, token: int) -> None: while True: with self._rag_sync_lock: if token != self._rag_sync_token: return # a newer sync will pick up our changes self._rag_sync_dirty = False # Build the engine, set self.rag_engine ... with self._rag_sync_lock: if not self._rag_sync_dirty: return token = self._rag_sync_token self._rag_sync_dirty = False ``` The exact body of `_do_rag_sync` should be the existing body of `_sync_rag_engine` (renamed). The `if not dirty: return` check at the end ensures we only loop when a NEW setter has fired. **CRITICAL — thread safety.** The lock protects `token` and `dirty`. The body of the sync runs WITHOUT the lock (to avoid blocking other setters). - [ ] **Step 4.1.6: Run the test to confirm it PASSES** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_sync_rag_engine_coalescing.py -v --timeout=30 ``` Expected: 3 tests PASS. - [ ] **Step 4.1.7: Run the RAG test in batch to verify the race is fixed** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py tests/test_rag_phase4_stress.py -v --timeout=120 ``` Expected: All 3 pass. The RAG stress test was previously non-deterministic; the coalescing makes it deterministic. - [ ] **Step 4.1.8: Commit (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add src/app_controller.py tests/test_sync_rag_engine_coalescing.py git commit -m "fix(rag): coalesce _sync_rag_engine calls via token + dirty flag" $h = git log -1 --format='%H' git notes add -m "Replaces the immediate-submit-to-io_pool pattern with a token + dirty flag. Multiple setters in quick succession produce one sync, not N parallel syncs. The RAG stress test, which was non-deterministic, is now deterministic. The lock is held only for token/dirty access; the sync body runs lock-free to avoid blocking other setters." $h ``` ### Phase 4 verification - [ ] **Step 4.V.1: 3 new tests in `tests/test_sync_rag_engine_coalescing.py` pass** - [ ] **Step 4.V.2: RAG stress test passes in batch (no longer non-deterministic)** - [ ] **Step 4.V.3: No regressions in tier-2 mock_app batch** - [ ] **Step 4.V.4: User reviews the io_pool race fix** - User confirms before Phase 5. --- ## Phase 5: FR4 — Fix `set_value` hook for `ai_input` Focus: Find the missing branch in `/api/gui/set_value` that causes `set_value('ai_input', ...)` to silently drop the assignment. ### Task 5.1: Reproduce the failure with a minimal test - [ ] **Step 5.1.1: Read the test that's currently failing** Use `manual-slop_get_file_slice` to read `tests/test_gui2_set_value_hook_works.py:1-50`. - [ ] **Step 5.1.2: Run the test to confirm the failure** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_gui2_set_value_hook_works.py -v --timeout=30 ``` Expected: FAIL. The failure mode is `set_value returns 'queued' but get_value('ai_input') returns ''`. - [ ] **Step 5.1.3: Trace the flow with diagnostic prints** The user's previous attempt at this diagnostic was rejected as "diagnostic noise in production." Use a temporary diagnostic file instead: - Create `tests/artifacts/diag_set_value.py` (gitignored). - Add prints to trace the flow. - Run the test with `pytest -s` to see the prints. - Once the root cause is identified, DELETE `tests/artifacts/diag_set_value.py` and apply the real fix. ### Task 5.2: Apply the fix **Files (TBD based on Task 5.1 findings):** - Likely: `src/api_hooks.py` (the `/api/gui/set_value` endpoint) - Possibly: `src/gui_2.py` (`__setattr__` or `_UI_FLAG_DEFAULTS` allowlist) - [ ] **Step 5.2.1: Apply the surgical fix** Use `manual-slop_edit_file` or `manual-slop_py_update_definition` as appropriate. - [ ] **Step 5.2.2: Run the test to verify the fix** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_gui2_set_value_hook_works.py -v --timeout=30 ``` Expected: PASS. - [ ] **Step 5.2.3: Commit (Tier 2 supervised)** ```powershell cd C:\projects\manual_slop; git add src/ git commit -m "fix(api_hooks): set_value('ai_input') actually mutates controller state" $h = git log -1 --format='%H' git notes add -m "Identifies the missing branch in the /api/gui/set_value endpoint that caused ai_input to silently drop. The fix is consistent with the _UI_FLAG_DEFAULTS allowlist pattern from bcdc26d0." $h ``` ### Phase 5 verification - [ ] **Step 5.V.1: `tests/test_gui2_set_value_hook_works.py` passes in batch** - [ ] **Step 5.V.2: No regressions in tier-3 batch** --- ## Phase 6: FR5 — Optional `clean_baseline` marker Focus: Add a marker that tests can opt into for a clean controller state. ### Task 6.1: Add the marker and the autouse fixture **Files:** - Modify: `tests/conftest.py` - Modify: `pyproject.toml` (add the marker to `[tool.pytest.ini_options].markers`) - [ ] **Step 6.1.1: Add the marker to `pyproject.toml`** Read `pyproject.toml` and find `[tool.pytest.ini_options]`. Add: ```toml "clean_baseline: mark a test as requiring a clean controller state at start. The autouse _reset_clean_baseline fixture will call /api/reset_session before the test." ``` to the existing `markers` list. - [ ] **Step 6.1.2: Write a failing test (TDD red)** Create `tests/test_clean_baseline_marker.py`: ```python import pytest @pytest.mark.clean_baseline def test_clean_baseline_ai_input_is_empty(live_gui): handle = live_gui client = handle.api_client client.set_value("ai_input", "polluted value") # The autouse fixture should reset BEFORE this point, but we set it AFTER to verify the reset works mid-test... actually no, the autouse runs BEFORE the test body. # So this test should verify that get_value('ai_input') is '' at the START of the test. # We need a different test for that. pass @pytest.mark.clean_baseline def test_clean_baseline_resets_ai_input_at_start(live_gui): # The PREVIOUS test set ai_input to "polluted value". If clean_baseline worked, this test's ai_input is ''. # But wait — the autouse runs BEFORE this test, so we need to verify that AFTER the autouse reset, ai_input is ''. handle = live_gui value = handle.api_client.get_value("ai_input") assert value == "", f"Expected empty ai_input at start of clean_baseline test, got {value!r}" ``` - [ ] **Step 6.1.3: Run the test to confirm it FAILS** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_clean_baseline_marker.py -v --timeout=30 ``` Expected: FAIL (no `clean_baseline` autouse yet). - [ ] **Step 6.1.4: Add the autouse fixture to `tests/conftest.py`** Insert AFTER the `_check_live_gui_health` fixture: ```python @pytest.fixture(autouse=True) def _reset_clean_baseline(request, live_gui): if request.node.get_closest_marker("clean_baseline"): handle = live_gui handle.api_client.reset_session() # existing endpoint yield ``` Use `manual-slop_py_add_def` with `anchor_type=after, anchor_symbol=_check_live_gui_health`. - [ ] **Step 6.1.5: Run the test to confirm it PASSES** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_clean_baseline_marker.py -v --timeout=30 ``` Expected: 2 tests PASS. - [ ] **Step 6.1.6: Commit** ```powershell cd C:\projects\manual_slop; git add tests/conftest.py tests/test_clean_baseline_marker.py pyproject.toml git commit -m "feat(test): clean_baseline marker resets controller state before test" $h = git log -1 --format='%H' git notes add -m "Adds an opt-in clean_baseline marker. Tests marked with @pytest.mark.clean_baseline get a fresh controller state via the existing /api/reset_session endpoint before they start. Two new tests verify the marker works." $h ``` ### Phase 6 verification - [ ] **Step 6.V.1: 2 new tests pass** - [ ] **Step 6.V.2: User reviews the marker API** - User confirms before Phase 7. --- ## Phase 7: FR6 — Run full batch + produce test_bed_health report Focus: Capture the post-track "after" state. Document what's green, what's red, and what's expected to remain red. ### Task 7.1: Run the full batched suite - [ ] **Step 7.1.1: Run tier-1 (unit tests)** ```powershell cd C:\projects\manual_slop; uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\post_track_batch_20260609.log" | Select-Object -First 200 ``` Expected: all tier-1 batches pass. - [ ] **Step 7.1.2: Run tier-2 (mock_app tests)** Same command, but capture the tier-2 portion. - [ ] **Step 7.1.3: Run tier-3 (live_gui tests)** Same command, but capture the tier-3 portion. Note: This is the big one; may take 10+ minutes. - [ ] **Step 7.1.4: Summarize pass/fail** From the captured log, extract: - Total tests run. - Tests passed. - Tests failed (with file:line and error message). ### Task 7.2: Produce the test_bed_health report - [ ] **Step 7.2.1: Write `docs/reports/test_bed_health_20260609.md`** Template: ```markdown # Test Bed Health Report (2026-06-09) **Track:** test_infrastructure_hardening_20260609 **Date:** 2026-06-09 **Status:** [GREEN / YELLOW / RED] ## Summary | Tier | Tests | Pass | Fail | New Failures | Resolved | |---|---|---|---|---|---| | tier-1 unit | N | N | 0 | 0 | 0 | | tier-2 mock_app | N | N | 0 | 0 | 0 | | tier-3 live_gui | N | N | M | 0 | K | | tier-H headless | N | N | 0 | 0 | 0 | | tier-P perf | N | N | 0 | 0 | 0 | ## Before vs. After | Symptom | Before | After | Resolved By | |---|---|---|---| | test_rag_phase4_final_verify in batch | FAIL | PASS | FR2 (tmp_path_factory) | | test_rag_phase4_stress in batch | FLAKY | PASS | FR3 (io_pool coalescing) | | test_gui2_set_value_hook_works | FAIL | PASS | FR4 (set_value fix) | | Per-test subprocess death | POISONS BATCH | RECOVERS | FR1 (autouse respawn) | | Hardcoded paths in test files | 6 files | 0 files | FR2 (live_gui_workspace fixture) | | io_pool race in _sync_rag_engine | YES | NO | FR3 (token + dirty flag) | ## Known Residual Failures - `test_mma_concurrent_tracks_execution` (FAIL, separate code path, MMA engine state transitions) - `test_mma_step_mode_approval_flow` (FAIL, same) - `test_mma_complete_lifecycle` (FAIL, same) - `test_z_negative_flows.py` x3 (FAIL, mock provider error path) - `test_auto_switch_sim` (FAIL, workspace auto-switch logic) These are documented as separate code paths, NOT test-isolation issues. They are deferred to follow-up tracks. ## Verification ```powershell uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\post_track_batch_20260609.log" ``` Full log saved to `tests/artifacts/post_track_batch_20260609.log`. ## Conclusion The 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) can start from a clean baseline. The "test regression nightmare" is killed for the categories the user identified: state pollution, path hygiene, and io_pool race. ``` - [ ] **Step 7.2.2: Commit the report** ```powershell cd C:\projects\manual_slop; git add docs/reports/test_bed_health_20260609.md tests/artifacts/post_track_batch_20260609.log git commit -m "docs(report): test_bed_health_20260609 - post-track batch status" $h = git log -1 --format='%H' git notes add -m "Captures the post-track batch state. All 3 root causes of test regression churn (state pollution, path hygiene, io_pool race) are fixed. The 4 upcoming tracks can start from a clean baseline." $h ``` ### Phase 7 verification - [ ] **Step 7.V.1: Tier-1, tier-2, tier-3 batch results captured in the report** - [ ] **Step 7.V.2: 0 new failures vs. baseline (Phase 0 capture)** - [ ] **Step 7.V.3: At least 3 previously-failing tests now pass in batch** (the "after" row of the table) --- ## Phase 8: Docs + extension of `check_test_toml_paths.py` Focus: Update the existing audit script to flag the hardcoded-path anti-pattern, and refresh the testing guide. ### Task 8.1: Extend `scripts/check_test_toml_paths.py` to flag `Path("tests/artifacts/")` and `Path("C:/projects/")` **Files:** - Modify: `scripts/check_test_toml_paths.py` - [ ] **Step 8.1.1: Read the existing script** Use `manual-slop_get_file_summary` on the script. Identify the regex/pattern matching logic. - [ ] **Step 8.1.2: Add the new patterns** Add to the script's pattern list: - `r'Path\(["\']tests/artifacts/["\']\)'` - `r'Path\(["\']C:[/\\]+projects'` These patterns match test files that hardcode the workspace path or the user's project root. - [ ] **Step 8.1.3: Run the audit to verify it flags the right files** ```powershell cd C:\projects\manual_slop; uv run python scripts/check_test_toml_paths.py --strict ``` Expected: 0 violations (the 6 files were updated in Phase 3). - [ ] **Step 8.1.4: Write a TDD test for the audit** Create `tests/test_check_test_toml_paths.py`: ```python def test_audit_flags_hardcoded_workspace_path(tmp_path): bad_file = tmp_path / "test_bad.py" bad_file.write_text('workspace = Path("tests/artifacts/live_gui_workspace")\n') # Run the audit on tmp_path result = subprocess.run( ["python", "scripts/check_test_toml_paths.py", "--strict", str(tmp_path)], capture_output=True, text=True ) assert result.returncode != 0 assert "test_bad.py" in result.stdout def test_audit_passes_clean_file(tmp_path): good_file = tmp_path / "test_good.py" good_file.write_text("workspace = live_gui_workspace\n") result = subprocess.run( ["python", "scripts/check_test_toml_paths.py", "--strict", str(tmp_path)], capture_output=True, text=True ) assert result.returncode == 0 ``` - [ ] **Step 8.1.5: Run the test to confirm it PASSES** ```powershell cd C:\projects\manual_slop; uv run pytest tests/test_check_test_toml_paths.py -v --timeout=15 ``` Expected: 2 tests PASS. - [ ] **Step 8.1.6: Commit** ```powershell cd C:\projects\manual_slop; git add scripts/check_test_toml_paths.py tests/test_check_test_toml_paths.py git commit -m "feat(audit): flag hardcoded workspace and project-root paths in tests" $h = git log -1 --format='%H' git notes add -m "Extends check_test_toml_paths.py to also flag Path('tests/artifacts/...') and Path('C:/projects/...') in test files. These are the two anti-patterns that the 6 test files in Phase 3 used to violate. Two new tests verify the audit." $h ``` ### Task 8.2: Update `docs/guide_testing.md` to document the new fixtures **Files:** - Modify: `docs/guide_testing.md` - [ ] **Step 8.2.1: Read the existing guide** Use `manual-slop_get_file_summary` to map the structure. - [ ] **Step 8.2.2: Add a new section "8. Per-test subprocess resilience"** Document: - The `_LiveGuiHandle` class. - The `_check_live_gui_health` autouse fixture. - The `live_gui_workspace` fixture. - The `clean_baseline` marker. ~50 lines of new content. - [ ] **Step 8.2.3: Commit** ```powershell cd C:\projects\manual_slop; git add docs/guide_testing.md git commit -m "docs(testing): document live_gui handle + workspace fixture + clean_baseline marker" $h = git log -1 --format='%H' git notes add -m "Adds a new section to guide_testing.md documenting the _LiveGuiHandle, _check_live_gui_health, live_gui_workspace, and clean_baseline marker. The section is placed in §8 (after the 7 conftest fixtures in §7)." $h ``` ### Phase 8 verification - [ ] **Step 8.V.1: `check_test_toml_paths.py --strict` passes with 0 violations** - [ ] **Step 8.V.2: 2 new tests for the audit pass** - [ ] **Step 8.V.3: `docs/guide_testing.md` updated** --- ## Final Verification - [ ] **All 5 FR1-FR5 implemented with TDD tests** - [ ] **All 4 audits committed in Phase 1** - [ ] **Test bed health report written and committed** - [ ] **`docs/guide_testing.md` updated** - [ ] **No new failures in tier-1 / tier-2 / tier-3 batch** - [ ] **At least 3 previously-failing tests now pass in batch** The track is done when the user reviews the test_bed_health report and confirms that the 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) can start from a clean baseline. --- ## Execution Constraints - **Tier 2 supervision required for:** Phase 1 (audit review), Phase 3 (conftest refactor), Phase 4 (io_pool race fix). These are the highest-risk phases. - **Per-task atomic commits.** One commit per task, never batch. - **Commit message format:** `(): `. - **Git note format:** 3-8 lines per commit. - **Style baseline:** 1-space indent, no comments, type hints, CRLF on Windows. - **TDD discipline:** Failing test first. No implementation before the red phase is confirmed. - **No diagnostic noise in production.** All diagnostic stderr goes to `tests/artifacts/*.diag.log`, never to `src/*.py`. Per `AGENTS.md` "No Diagnostic Noise in Production" rule. - **Deduction loop cap:** 2 test runs per investigation. If a test fails twice, read the code, predict the failure mode, instrument in one pass, then run a third time. If it still fails, escalate to the user. - **Conftest corruption safety:** Before ANY edit to `tests/conftest.py`, run `git stash` (or `git add . && git commit --allow-empty`). If the edit fails, `git stash pop` and re-investigate. The previous attempt at the conftest refactor was reverted due to corruption.