ed/manual_slop

Private

Public Access

Fork 0

Files

T

ed 566cf08cb8 conductor(track): test_infrastructure_hardening_20260609 - spec to kill the test regression nightmare

2026-06-09 15:15:26 -04:00

46 KiB

Raw Blame History

Test Infrastructure Hardening — Implementation Plan

For Tier 3 workers: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Tier 2 supervision required: Phase 1, Phase 3 (the conftest refactor), and Phase 4 (the _sync_rag_engine race fix) MUST be supervised by a Tier 2 Tech Lead. These touch the session-scoped live_gui fixture and the controller's hot path; the prior attempt at the conftest refactor was reverted due to corruption (see docs/reports/rag_test_batch_failure_status_20260609_pm3.md).

Goal: Fix the 3 root causes of test regression churn (subprocess state pollution, filesystem path hygiene, io_pool race) + 2 related bugs (set_value hook, optional clean-baseline) so the 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) start from a known green baseline.

Architecture: Each phase is self-contained, with TDD: failing test first, then minimum implementation, then verify pass, then commit. Per-task atomic commits. No batching.

Tech Stack: Python 3.11+, pytest, FastAPI/Uvicorn (live_gui), tmp_path_factory, threading.Lock.

Pre-Phase 0: Tier 2 checkpoint + dirty-state audit

Before starting Phase 1, the Tier 2 Tech Lead must:

Step 0.1: Read all referenced reports
- docs/reports/rag_test_batch_failure_status_20260609_pm3.md (filesystem hygiene findings)
- docs/reports/rag_work_final_20260609_pm.md (io_pool race, set_value hook)
- docs/reports/test_infra_hardening_foundation_20260608.md (foundation, 5 phases)
- docs/reports/batch_resilience_plan_20260608.md (4 batch-resilience solutions)
- conductor/edit_workflow.md (surgical tool guidance)
Step 0.2: Verify the dirty working tree is safe
- Working tree currently has uncommitted changes in config.toml, manualslop_layout.ini, project_history.toml, src/warmup.py. These are user workspace artifacts, NOT test infrastructure.
- Do NOT commit these. They are out of scope.
- Use git stash --keep-index or commit them separately if the user requests.
Step 0.3: Run the current batch baseline to capture "before" state
```
cd C:\projects\manual_slop; uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\batch_baseline_20260609.log" | Select-Object -Last 50
```
Expected: tier-1 + tier-2 pass; tier-3 has the documented failures (RAG dim-mismatch, set_value hook, RAG phase4 final verify, RAG phase4 stress).

Phase 1: Audit (no code changes)

Focus: Catalog the existing state so Phases 2-7 have a data-grounded baseline.

Task 1.1: Enumerate `live_gui` test cross-file state dependencies

Files:

Read: tests/conftest.py:282-547 (the live_gui fixture)
Read: all 49+ test files that use live_gui

Step 1.1.1: Generate the live_gui test inventory

cd C:\projects\manual_slop; uv run python -c "
from pathlib import Path
import re
root = Path('tests')
files = sorted(root.glob('test_*.py'))
users = []
for f in files:
    text = f.read_text(encoding='utf-8')
    if 'live_gui' in text:
        users.append(f.name)
print(f'{len(users)} test files use live_gui:')
for u in users:
    print(f'  {u}')
"

Save output to conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_users.txt.

Step 1.1.2: For each live_gui test file, grep for set_value calls and get_value calls

cd C:\projects\manual_slop; foreach ($f in (Get-Content conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_users.txt)) {
  Write-Host "=== $f ===";
  Select-String -Path "tests\$f" -Pattern '(set_value|get_value|reset_session)' | Select-Object LineNumber, Line | Format-Table -AutoSize
} | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_state_io.txt"

Save output to the audit directory. This shows which tests read state set by other tests.

Step 1.1.3: Categorize each test as "self-contained" or "cross-test-dependent" Self-contained = no set_value calls OR all set_value calls are within the same test function. Cross-test-dependent = has get_value calls that depend on a prior test's set_value.

Save the categorization to conductor/tracks/test_infrastructure_hardening_20260609/audit/live_gui_dependencies.json:
```
{
  "self_contained": ["test_a.py", "test_b.py", ...],
  "cross_test_dependent": ["test_x.py::test_y", ...]
}
```
- Step 1.1.4: Commit the audit
```
cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/
git commit -m "conductor(audit): catalog live_gui test cross-file state dependencies"
```

Task 1.2: Document the current `live_gui_workspace` path-hygiene state

Step 1.2.1: Find all hardcoded references to tests/artifacts/live_gui_workspace

cd C:\projects\manual_slop; rg -n "tests/artifacts/live_gui_workspace" tests/ --type py | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_paths.txt"

Expect 7+ matches per the spec's "Files affected" list.

Step 1.2.2: Find all Path("C:/projects/") or Path("C:\\\\projects\\\\") references in test files

cd C:\projects\manual_slop; rg -n 'Path\("C:[/\\]+projects' tests/ --type py | Tee-Object -FilePath "conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_project_root.txt"

Expect 0+ matches (the spec says none in production code; verify in tests too).

Step 1.2.3: Commit the audit

cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/hardcoded_*.txt
git commit -m "conductor(audit): document hardcoded workspace paths in test suite"

Task 1.3: Document the current `_sync_rag_engine` race

Step 1.3.1: Read src/app_controller.py:_sync_rag_engine and its callers Use manual-slop_py_get_definition to read _sync_rag_engine. Identify:
- The set of setters that trigger sync (e.g., rag_collection_name, files, rag_enabled, rag_source, rag_emb_provider).
- The submit-to-io_pool call site.
- Whether there's any existing coalescing/debouncing.

Step 1.3.2: Write the audit to conductor/tracks/test_infrastructure_hardening_20260609/audit/sync_rag_race.md Format:

# _sync_rag_engine Race Audit

## Setters that trigger sync
- `set_rag_collection_name` (src/app_controller.py:N)
- `set_rag_enabled` (src/app_controller.py:N)
- `set_files` (src/app_controller.py:N)
- ...

## Submit pattern
[paste 5-10 lines of the submit call]

## Coalescing mechanism
[None / Token-based / Lock-based / etc.]

## Race scenario
1. Test fires setter A → submit task T1
2. Test fires setter B (50ms later) → submit task T2
3. T1 starts on io_pool thread, starts constructing RAGEngine
4. T2 starts on a different io_pool thread, starts constructing RAGEngine
5. T1 finishes first, sets self.rag_engine = engine_A
6. T2 finishes, sets self.rag_engine = engine_B
7. Test queries self.rag_engine → engine_B (last writer wins)
8. engine_B may not have indexed the file from setter A → test fails

Step 1.3.3: Commit the audit

cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/sync_rag_race.md
git commit -m "conductor(audit): document _sync_rag_engine race in controller"

Task 1.4: Document the `set_value` hook for `ai_input`

Step 1.4.1: Read src/api_hooks.py /api/gui/set_value endpoint Use manual-slop_py_get_definition to find the endpoint. Identify the parameter-to-handler mapping.
Step 1.4.2: Read src/gui_2.py:__setattr__ and the _UI_FLAG_DEFAULTS allowlist Use manual-slop_py_get_definition to read both. Verify the allowlist is in place (from commit bcdc26d0).
Step 1.4.3: Test the failing case directly via the live_gui fixture Write a diagnostic test (NOT yet committed) that:
1. Gets the live_gui fixture.
2. Calls client.set_value('ai_input', 'hello').
3. Waits 0.5s.
4. Calls client.get_value('ai_input').
5. Prints the result.
Run with: cd C:\projects\manual_slop; uv run pytest -s -xvs --no-header tests/test_gui2_set_value_hook_works.py 2>&1 | Select-Object -Last 30

If the test fails, read the API hooks endpoint to find the missing branch.

Step 1.4.4: Write the audit to conductor/tracks/test_infrastructure_hardening_20260609/audit/set_value_hook.md Format:

# set_value('ai_input') Audit

## Endpoint code path
[paste the relevant 10-20 lines from /api/gui/set_value]

## Expected flow
1. POST /api/gui/set_value with {"field": "ai_input", "value": "hello"}
2. Endpoint calls controller.set_ai_input("hello") (or similar)
3. Controller sets self.ai_input = "hello"
4. Subsequent get_value('ai_input') returns "hello"

## Actual flow (from diagnostic)
1. POST returns 'queued'
2. Controller does NOT set self.ai_input
3. Subsequent get_value returns ''

## Root cause
[Identify the missing branch — likely the /api/gui/set_value endpoint has a hardcoded list of fields it handles, and 'ai_input' is not on the list, OR the controller's __setattr__ drops the assignment]

## Fix location
[src/api_hooks.py:N or src/gui_2.py:N]

Step 1.4.5: Commit the audit

cd C:\projects\manual_slop; git add conductor/tracks/test_infrastructure_hardening_20260609/audit/set_value_hook.md
git commit -m "conductor(audit): trace set_value('ai_input') flow to find routing bug"

Phase 1 verification

Step 1.V.1: All 4 audit files committed
- audit/live_gui_users.txt
- audit/live_gui_state_io.txt
- audit/live_gui_dependencies.json
- audit/hardcoded_paths.txt
- audit/hardcoded_project_root.txt
- audit/sync_rag_race.md
- audit/set_value_hook.md
Step 1.V.2: User reviews the audit
- Tier 2 Tech Lead presents the audit to the user.
- User approves before Phase 2 begins.

Phase 2: FR1 — Per-test subprocess health check + respawn

Focus: Add an autouse fixture that recovers the live_gui subprocess before each test, when degraded.

Task 2.1: Add a `_LiveGuiHandle` class with `ensure_alive()`

Files:

Modify: tests/conftest.py (add _LiveGuiHandle class BEFORE the live_gui fixture)
Step 2.1.1: Pre-edit checkpoint (Tier 2 supervised)
```
cd C:\projects\manual_slop; git stash push -m "wip before Phase 2"
```
(The current working tree has user workspace artifacts that should NOT be in this commit; stash them and re-apply after Phase 2's commit.)
Step 2.1.2: Read the existing live_gui fixture Read tests/conftest.py:282-547 with manual-slop_get_file_slice. Note:
- The current live_gui fixture creates a subprocess at line 412-450.
- The fixture's finally block (line 516-547) calls kill_process_tree.
- The fixture yields (process, gui_script) (a tuple).

Step 2.1.3: Refactor live_gui to use a _LiveGuiHandle class Insert a new class BEFORE the live_gui fixture (around line 280):

class _LiveGuiHandle:
    def __init__(self, gui_script: str, workspace: Path, log_path: Path) -> None:
        self._gui_script = gui_script
        self._workspace = workspace
        self._log_path = log_path
        self._process: subprocess.Popen | None = None
        self._lock = threading.Lock()
        self._respawn_count = 0
        self._spawn()

    def _spawn(self) -> None:
        # Existing fixture spawn logic, lifted from conftest.py:412-450
        # (use the actual spawn logic from the current fixture)
        ...

    def is_alive(self) -> bool:
        return self._process is not None and self._process.poll() is None

    def ensure_alive(self) -> None:
        with self._lock:
            if not self.is_alive():
                self._respawn_count += 1
                self._spawn()

    @property
    def process(self) -> subprocess.Popen:
        self.ensure_alive()
        assert self._process is not None
        return self._process

    @property
    def respawn_count(self) -> int:
        return self._respawn_count

CRITICAL — 1-space indent. Use the exact pattern from tests/conftest.py. Do not introduce 4-space indent.

Use manual-slop_py_add_def to insert the class at top of the file. Verify the indent via ast.parse after insertion.

Step 2.1.4: Refactor the live_gui fixture to use the handle Change the fixture from yielding a tuple (process, gui_script) to yielding a _LiveGuiHandle instance. Update the docstring.

Use manual-slop_set_file_slice to replace the fixture's body. Verify the indent.

Step 2.1.5: Update all 49 live_gui tests to use the new API The current pattern is:

def test_x(live_gui):
    process, gui_script = live_gui

The new pattern is:

def test_x(live_gui):
    handle = live_gui
    process = handle.process

This is a sweep across all 49 test files. Use rg to find all process, gui_script = live_gui lines, then sed/Python to replace.

cd C:\projects\manual_slop; uv run python -c "
from pathlib import Path
root = Path('tests')
count = 0
for f in root.glob('test_*.py'):
    text = f.read_text(encoding='utf-8')
    if 'process, gui_script = live_gui' in text:
        new_text = text.replace('process, gui_script = live_gui', 'handle = live_gui')
        f.write_text(new_text, encoding='utf-8')
        count += 1
print(f'Updated {count} test files')
"

This is a single in-place edit; verify with git diff --stat.

Step 2.1.6: Run a representative live_gui test to verify the refactor
```
cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30
```
Expected: PASS. If FAIL, revert via git checkout tests/ and re-investigate.

Step 2.1.7: Commit the refactor (Tier 2 supervised)

cd C:\projects\manual_slop; git add tests/
git commit -m "refactor(test): wrap live_gui subprocess in _LiveGuiHandle class"
$h = git log -1 --format='%H'
git notes add -m "Refactor the session-scoped live_gui fixture to yield a _LiveGuiHandle instead of a (process, gui_script) tuple. The handle has ensure_alive() and respawn_count. All 49 dependent test files updated to consume the handle. Foundation for the autouse _check_live_gui_health fixture in Task 2.2." $h

Task 2.2: Add the autouse `_check_live_gui_health` fixture

Files:

Modify: tests/conftest.py (add the autouse fixture AFTER the live_gui fixture)

Step 2.2.1: Write a failing test (TDD red) Create tests/test_live_gui_respawn.py:

import pytest
import time

def test_live_gui_respawn_after_kill(live_gui):
    handle = live_gui
    initial_pid = handle.process.pid
    initial_respawn_count = handle.respawn_count
    handle.process.kill()
    handle.process.wait(timeout=5)
    assert not handle.is_alive()
    handle.ensure_alive()
    assert handle.is_alive()
    new_pid = handle.process.pid
    assert new_pid != initial_pid
    assert handle.respawn_count == initial_respawn_count + 1

def test_live_gui_no_respawn_on_clean(live_gui):
    handle = live_gui
    initial_count = handle.respawn_count
    handle.ensure_alive()
    assert handle.respawn_count == initial_count

def test_live_gui_health_check_fast_path(live_gui):
    handle = live_gui
    t0 = time.perf_counter()
    handle.ensure_alive()
    elapsed = time.perf_counter() - t0
    assert elapsed < 0.1, f"ensure_alive took {elapsed:.3f}s on a clean subprocess"

Step 2.2.2: Run the test to confirm it FAILS
```
cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_respawn.py -v --timeout=30
```
Expected: FAIL (the respawn_count attribute doesn't exist yet).

Step 2.2.3: Add the autouse fixture to tests/conftest.py Insert AFTER the live_gui fixture:

@pytest.fixture(autouse=True)
def _check_live_gui_health(request, live_gui):
    if "live_gui" in request.fixturenames:
        handle = live_gui
        handle.ensure_alive()
    yield

Use manual-slop_py_add_def with anchor_type after and anchor_symbol live_gui.

Step 2.2.4: Run the test to confirm it PASSES

cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_respawn.py -v --timeout=30

Expected: 3 tests PASS.

Step 2.2.5: Run the full tier-3 live_gui batch to verify no regression
```
cd C:\projects\manual_slop; uv run pytest tests/ -k "live_gui" -v --timeout=30 -x 2>&1 | Select-Object -Last 50
```
Expected: Most tests pass; the documented failures (RAG dim-mismatch, set_value, RAG phase4) still fail. NO new failures.

Step 2.2.6: Commit (Tier 2 supervised)

cd C:\projects\manual_slop; git add tests/conftest.py tests/test_live_gui_respawn.py
git commit -m "feat(test): autouse _check_live_gui_health recovers from degraded subprocess"
$h = git log -1 --format='%H'
git notes add -m "Adds an autouse fixture that calls handle.ensure_alive() before each test that uses live_gui. If the subprocess is dead, it respawns. If alive, the check is <100ms. Three new tests in tests/test_live_gui_respawn.py verify the respawn, the no-op-on-clean path, and the performance budget." $h

Phase 2 verification

Step 2.V.1: 3 new tests in tests/test_live_gui_respawn.py pass
Step 2.V.2: No new regressions in tier-3 batch
Step 2.V.3: User reviews the autouse respawn behavior
- Per-test respawn adds <200ms per test. Verify with the 49 tests in batch.
- User approves before Phase 3 begins.

Phase 3: FR2 — `live_gui_workspace` fixture + update 6 test files

Focus: Eliminate hardcoded Path("tests/artifacts/live_gui_workspace") from test files. Use tmp_path_factory.mktemp.

Tier 2 supervised for entire phase (the prior attempt at this refactor was reverted due to corruption; see docs/reports/rag_test_batch_failure_status_20260609_pm3.md).

Task 3.1: Refactor `live_gui` to use `tmp_path_factory.mktemp`

Files:

Modify: tests/conftest.py (the live_gui fixture's workspace creation)

Step 3.1.1: Pre-edit checkpoint

cd C:\projects\manual_slop; git add . && git commit -m "wip: pre-Phase 3 checkpoint" --allow-empty

Step 3.1.2: Use manual-slop_set_file_slice to replace the workspace creation Read tests/conftest.py:410-414 with manual-slop_get_file_slice. Note the EXACT text of the lines that create the workspace (the Path("tests/artifacts/live_gui_workspace") reference and the surrounding os.makedirs or mkdir calls).

Replace ONLY those lines with:
```
workspace = tmp_path_factory.mktemp("live_gui_workspace")
```
where tmp_path_factory is added to the fixture's parameters.

The fixture signature changes from:
```
def live_gui(request):
```
to:
```
def live_gui(request, tmp_path_factory):
```
CRITICAL — verify via ast.parse after the edit.

Use manual-slop_py_check_syntax tests/conftest.py to confirm syntax is valid.
Step 3.1.3: Verify the fixture still spawns the subprocess correctly
```
cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30
```
Expected: PASS. If FAIL, the workspace path is being constructed wrong.
Step 3.1.4: Verify the new workspace is a tmp dir (not under project tree) Add a debug print to the test:
```
cd C:\projects\manual_slop; uv run pytest tests/test_gui_startup_smoke.py -v --timeout=30 -s 2>&1 | Select-String "workspace"
```
Expect: workspace is under C:\Users\...\AppData\Local\Temp\..., NOT C:\projects\manual_slop\tests\artifacts\....

Step 3.1.5: Commit (Tier 2 supervised)

cd C:\projects\manual_slop; git add tests/conftest.py
git commit -m "refactor(test): live_gui workspace via tmp_path_factory"
$h = git log -1 --format='%H'
git notes add -m "Replaces the hardcoded Path('tests/artifacts/live_gui_workspace') with tmp_path_factory.mktemp('live_gui_workspace'). The workspace now lives in pytest's tmp dir, not in the project tree. Foundation for exposing the workspace path as a separate fixture in Task 3.2." $h

Task 3.2: Expose `live_gui_workspace` as a separate fixture

Files:

Modify: tests/conftest.py (add a new fixture)

Step 3.2.1: Write a failing test (TDD red) Create tests/test_live_gui_workspace_fixture.py:

from pathlib import Path

def test_live_gui_workspace_is_absolute(live_gui_workspace):
    assert live_gui_workspace.is_absolute()

def test_live_gui_workspace_unique_per_session(live_gui, live_gui_workspace):
    assert live_gui_workspace.exists()
    assert (live_gui_workspace / ".placeholder").exists() or True  # fixture is empty

def test_live_gui_workspace_passed_to_test(live_gui_workspace):
    test_file = live_gui_workspace / "test_file.txt"
    test_file.write_text("hello")
    assert test_file.read_text() == "hello"

Step 3.2.2: Run the test to confirm it FAILS

cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_workspace_fixture.py -v --timeout=30

Expected: FAIL (no live_gui_workspace fixture yet).

Step 3.2.3: Add the live_gui_workspace fixture to tests/conftest.py Insert AFTER the live_gui fixture:
```
@pytest.fixture
def live_gui_workspace(live_gui) -> Path:
    handle = live_gui
    return handle._workspace  # type: ignore[attr-defined]
```
The handle has the workspace as _workspace (set in Task 2.1.3).

Use manual-slop_py_add_def with anchor_type=after, anchor_symbol=live_gui.

Step 3.2.4: Run the test to confirm it PASSES

cd C:\projects\manual_slop; uv run pytest tests/test_live_gui_workspace_fixture.py -v --timeout=30

Expected: 3 tests PASS.

Step 3.2.5: Commit

cd C:\projects\manual_slop; git add tests/conftest.py tests/test_live_gui_workspace_fixture.py
git commit -m "feat(test): expose live_gui_workspace as a separate fixture"
$h = git log -1 --format='%H'
git notes add -m "Adds the live_gui_workspace fixture that returns the absolute path to the live_gui subprocess's workspace. Tests that need to create files in the workspace should request this fixture instead of hardcoding Path('tests/artifacts/live_gui_workspace')." $h

Task 3.3: Update the 6 dependent test files

Files:

Modify: 6 test files that hardcode the workspace path
Step 3.3.1: Read each test file and identify the hardcoded reference For each of:
- tests/test_rag_phase4_final_verify.py:20
- tests/test_rag_phase4_stress.py:21
- tests/test_saved_presets_sim.py:14, 121
- tests/test_tool_presets_sim.py:13
- tests/test_visual_sim_gui_ux.py:79
Read the surrounding 5 lines with manual-slop_get_file_slice to understand the context. The pattern is:
```
workspace = Path("tests/artifacts/live_gui_workspace")
workspace.mkdir(parents=True, exist_ok=True)
# ... use workspace
```
Step 3.3.2: For each test file, refactor to use the fixture For each file, do this surgical edit:
1. Add live_gui_workspace to the test function's parameter list.
2. Replace Path("tests/artifacts/live_gui_workspace") with live_gui_workspace.
3. Remove the mkdir call (the fixture creates the dir).
4. Use live_gui_workspace.mkdir(parents=True, exist_ok=True) ONLY if subsequent code needs the dir to exist before the fixture's init (rare).
Use manual-slop_edit_file for each replacement. One file at a time. Verify after each.

Step 3.3.3: Run each updated test file to verify the refactor For each of the 6 files, run:

cd C:\projects\manual_slop; uv run pytest tests/test_rag_phase4_final_verify.py -v --timeout=60
uv run pytest tests/test_rag_phase4_stress.py -v --timeout=60
uv run pytest tests/test_saved_presets_sim.py -v --timeout=60
uv run pytest tests/test_tool_presets_sim.py -v --timeout=60
uv run pytest tests/test_visual_sim_gui_ux.py -v --timeout=60

Expected: Each file passes in isolation. If any fails, the refactor broke something — investigate.

Step 3.3.4: Run the same files in batch to verify the BATCH failure is fixed
```
cd C:\projects\manual_slop; uv run pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py -v --timeout=120
```
Expected: The RAG test PASSES after the 4 sims. This is the primary symptom the user wanted fixed.

Step 3.3.5: Commit (Tier 2 supervised)

cd C:\projects\manual_slop; git add tests/
git commit -m "refactor(test): 6 test files use live_gui_workspace fixture instead of hardcoded path"
$h = git log -1 --format='%H'
git notes add -m "The 6 test files that hardcoded Path('tests/artifacts/live_gui_workspace') now request the live_gui_workspace fixture, which yields the absolute path. The RAG test passes in batch (after 4 sims) for the first time, because the workspace path is now absolute and CWD-independent." $h

Phase 3 verification

Step 3.V.1: 6 test files updated and pass in isolation
Step 3.V.2: RAG test passes in batch (after 4 sims) — the primary goal
Step 3.V.3: tests/test_live_gui_workspace_fixture.py 3 tests pass
Step 3.V.4: User reviews the RAG test passing in batch
- This is the "kill the nightmare" moment. User confirms before Phase 4.

Phase 4: FR3 — Coalesce `_sync_rag_engine` calls

Focus: Eliminate the io_pool race in app_controller._sync_rag_engine so multiple setters in quick succession produce one sync, not N parallel syncs.

Tier 2 supervised for entire phase. This touches the controller's hot path.

Task 4.1: Add a token-based coalescing mechanism

Files:

Modify: src/app_controller.py (the _sync_rag_engine method and the setters that trigger it)
Step 4.1.1: Read the existing _sync_rag_engine and the setters Use manual-slop_py_get_definition on AppController._sync_rag_engine. Identify:
- The exact submit-to-io_pool call.
- The setters that call _sync_rag_engine (search for _sync_rag_engine usages).

Step 4.1.2: Add the coalescing state to AppController.__init__ Add to AppController.__init__ (use manual-slop_py_set_var_declaration):

self._rag_sync_token: int = 0
self._rag_sync_dirty: bool = False
self._rag_sync_lock: threading.Lock = threading.Lock()

Step 4.1.3: Write a failing test (TDD red) Create tests/test_sync_rag_engine_coalescing.py:

from unittest.mock import patch, MagicMock
from src.app_controller import AppController

def test_sync_rag_engine_coalesces_five_setters():
    # Construct a minimal AppController (use existing fixture if available)
    # Patch the io_pool to count sync submissions
    with patch("src.app_controller.AppController._io_pool") as mock_pool:
        ctrl = AppController(...)
        for i in range(5):
            ctrl.set_rag_collection_name(f"name_{i}")
        # Assert: mock_pool.submit was called 0 times (or 1 time, with 5 setters coalesced)
        ...

def test_sync_rag_engine_rerun_on_token_change():
    ...

def test_sync_rag_engine_idempotent_no_changes():
    ...

Note: This test may require the existing test fixture for AppController. If no such fixture exists, use a minimal one (construct the controller with a tmp_path, mock the heavy dependencies).

Step 4.1.4: Run the test to confirm it FAILS

cd C:\projects\manual_slop; uv run pytest tests/test_sync_rag_engine_coalescing.py -v --timeout=30

Expected: FAIL (no coalescing yet).

Step 4.1.5: Refactor _sync_rag_engine to use the token + dirty flag Use manual-slop_py_update_definition to replace _sync_rag_engine:

def _sync_rag_engine(self) -> None:
    with self._rag_sync_lock:
        self._rag_sync_token += 1
        self._rag_sync_dirty = True
        token = self._rag_sync_token
    self._io_pool.submit(self._do_rag_sync, token)

def _do_rag_sync(self, token: int) -> None:
    while True:
        with self._rag_sync_lock:
            if token != self._rag_sync_token:
                return  # a newer sync will pick up our changes
            self._rag_sync_dirty = False
        # Build the engine, set self.rag_engine
        ...
        with self._rag_sync_lock:
            if not self._rag_sync_dirty:
                return
            token = self._rag_sync_token
            self._rag_sync_dirty = False

The exact body of _do_rag_sync should be the existing body of _sync_rag_engine (renamed). The if not dirty: return check at the end ensures we only loop when a NEW setter has fired.

CRITICAL — thread safety. The lock protects token and dirty. The body of the sync runs WITHOUT the lock (to avoid blocking other setters).

Step 4.1.6: Run the test to confirm it PASSES

cd C:\projects\manual_slop; uv run pytest tests/test_sync_rag_engine_coalescing.py -v --timeout=30

Expected: 3 tests PASS.

Step 4.1.7: Run the RAG test in batch to verify the race is fixed
```
cd C:\projects\manual_slop; uv run pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py tests/test_rag_phase4_stress.py -v --timeout=120
```
Expected: All 3 pass. The RAG stress test was previously non-deterministic; the coalescing makes it deterministic.

Step 4.1.8: Commit (Tier 2 supervised)

cd C:\projects\manual_slop; git add src/app_controller.py tests/test_sync_rag_engine_coalescing.py
git commit -m "fix(rag): coalesce _sync_rag_engine calls via token + dirty flag"
$h = git log -1 --format='%H'
git notes add -m "Replaces the immediate-submit-to-io_pool pattern with a token + dirty flag. Multiple setters in quick succession produce one sync, not N parallel syncs. The RAG stress test, which was non-deterministic, is now deterministic. The lock is held only for token/dirty access; the sync body runs lock-free to avoid blocking other setters." $h

Phase 4 verification

Step 4.V.1: 3 new tests in tests/test_sync_rag_engine_coalescing.py pass
Step 4.V.2: RAG stress test passes in batch (no longer non-deterministic)
Step 4.V.3: No regressions in tier-2 mock_app batch
Step 4.V.4: User reviews the io_pool race fix
- User confirms before Phase 5.

Phase 5: FR4 — Fix `set_value` hook for `ai_input`

Focus: Find the missing branch in /api/gui/set_value that causes set_value('ai_input', ...) to silently drop the assignment.

Task 5.1: Reproduce the failure with a minimal test

Step 5.1.1: Read the test that's currently failing Use manual-slop_get_file_slice to read tests/test_gui2_set_value_hook_works.py:1-50.
Step 5.1.2: Run the test to confirm the failure
```
cd C:\projects\manual_slop; uv run pytest tests/test_gui2_set_value_hook_works.py -v --timeout=30
```
Expected: FAIL. The failure mode is set_value returns 'queued' but get_value('ai_input') returns ''.
Step 5.1.3: Trace the flow with diagnostic prints The user's previous attempt at this diagnostic was rejected as "diagnostic noise in production." Use a temporary diagnostic file instead:
- Create tests/artifacts/diag_set_value.py (gitignored).
- Add prints to trace the flow.
- Run the test with pytest -s to see the prints.
- Once the root cause is identified, DELETE tests/artifacts/diag_set_value.py and apply the real fix.

Task 5.2: Apply the fix

Files (TBD based on Task 5.1 findings):

Likely: src/api_hooks.py (the /api/gui/set_value endpoint)
Possibly: src/gui_2.py (__setattr__ or _UI_FLAG_DEFAULTS allowlist)
Step 5.2.1: Apply the surgical fix Use manual-slop_edit_file or manual-slop_py_update_definition as appropriate.

Step 5.2.2: Run the test to verify the fix

cd C:\projects\manual_slop; uv run pytest tests/test_gui2_set_value_hook_works.py -v --timeout=30

Expected: PASS.

Step 5.2.3: Commit (Tier 2 supervised)

cd C:\projects\manual_slop; git add src/
git commit -m "fix(api_hooks): set_value('ai_input') actually mutates controller state"
$h = git log -1 --format='%H'
git notes add -m "Identifies the missing branch in the /api/gui/set_value endpoint that caused ai_input to silently drop. The fix is consistent with the _UI_FLAG_DEFAULTS allowlist pattern from bcdc26d0." $h

Phase 5 verification

Step 5.V.1: tests/test_gui2_set_value_hook_works.py passes in batch
Step 5.V.2: No regressions in tier-3 batch

Phase 6: FR5 — Optional `clean_baseline` marker

Focus: Add a marker that tests can opt into for a clean controller state.

Task 6.1: Add the marker and the autouse fixture

Files:

Modify: tests/conftest.py
Modify: pyproject.toml (add the marker to [tool.pytest.ini_options].markers)

Step 6.1.1: Add the marker to pyproject.toml Read pyproject.toml and find [tool.pytest.ini_options]. Add:

"clean_baseline: mark a test as requiring a clean controller state at start. The autouse _reset_clean_baseline fixture will call /api/reset_session before the test."

to the existing markers list.

Step 6.1.2: Write a failing test (TDD red) Create tests/test_clean_baseline_marker.py:

import pytest

@pytest.mark.clean_baseline
def test_clean_baseline_ai_input_is_empty(live_gui):
    handle = live_gui
    client = handle.api_client
    client.set_value("ai_input", "polluted value")
    # The autouse fixture should reset BEFORE this point, but we set it AFTER to verify the reset works mid-test... actually no, the autouse runs BEFORE the test body.
    # So this test should verify that get_value('ai_input') is '' at the START of the test.
    # We need a different test for that.
    pass

@pytest.mark.clean_baseline
def test_clean_baseline_resets_ai_input_at_start(live_gui):
    # The PREVIOUS test set ai_input to "polluted value". If clean_baseline worked, this test's ai_input is ''.
    # But wait — the autouse runs BEFORE this test, so we need to verify that AFTER the autouse reset, ai_input is ''.
    handle = live_gui
    value = handle.api_client.get_value("ai_input")
    assert value == "", f"Expected empty ai_input at start of clean_baseline test, got {value!r}"

Step 6.1.3: Run the test to confirm it FAILS

cd C:\projects\manual_slop; uv run pytest tests/test_clean_baseline_marker.py -v --timeout=30

Expected: FAIL (no clean_baseline autouse yet).

Step 6.1.4: Add the autouse fixture to tests/conftest.py Insert AFTER the _check_live_gui_health fixture:

@pytest.fixture(autouse=True)
def _reset_clean_baseline(request, live_gui):
    if request.node.get_closest_marker("clean_baseline"):
        handle = live_gui
        handle.api_client.reset_session()  # existing endpoint
    yield

Use manual-slop_py_add_def with anchor_type=after, anchor_symbol=_check_live_gui_health.

Step 6.1.5: Run the test to confirm it PASSES

cd C:\projects\manual_slop; uv run pytest tests/test_clean_baseline_marker.py -v --timeout=30

Expected: 2 tests PASS.

Step 6.1.6: Commit

cd C:\projects\manual_slop; git add tests/conftest.py tests/test_clean_baseline_marker.py pyproject.toml
git commit -m "feat(test): clean_baseline marker resets controller state before test"
$h = git log -1 --format='%H'
git notes add -m "Adds an opt-in clean_baseline marker. Tests marked with @pytest.mark.clean_baseline get a fresh controller state via the existing /api/reset_session endpoint before they start. Two new tests verify the marker works." $h

Phase 6 verification

Step 6.V.1: 2 new tests pass
Step 6.V.2: User reviews the marker API
- User confirms before Phase 7.

Phase 7: FR6 — Run full batch + produce test_bed_health report

Focus: Capture the post-track "after" state. Document what's green, what's red, and what's expected to remain red.

Task 7.1: Run the full batched suite

Step 7.1.1: Run tier-1 (unit tests)

cd C:\projects\manual_slop; uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\post_track_batch_20260609.log" | Select-Object -First 200

Expected: all tier-1 batches pass.

Step 7.1.2: Run tier-2 (mock_app tests) Same command, but capture the tier-2 portion.
Step 7.1.3: Run tier-3 (live_gui tests) Same command, but capture the tier-3 portion. Note: This is the big one; may take 10+ minutes.
Step 7.1.4: Summarize pass/fail From the captured log, extract:
- Total tests run.
- Tests passed.
- Tests failed (with file:line and error message).

Task 7.2: Produce the test_bed_health report

Step 7.2.1: Write docs/reports/test_bed_health_20260609.md Template:

# Test Bed Health Report (2026-06-09)

**Track:** test_infrastructure_hardening_20260609
**Date:** 2026-06-09
**Status:** [GREEN / YELLOW / RED]

## Summary

| Tier | Tests | Pass | Fail | New Failures | Resolved |
|---|---|---|---|---|---|
| tier-1 unit | N | N | 0 | 0 | 0 |
| tier-2 mock_app | N | N | 0 | 0 | 0 |
| tier-3 live_gui | N | N | M | 0 | K |
| tier-H headless | N | N | 0 | 0 | 0 |
| tier-P perf | N | N | 0 | 0 | 0 |

## Before vs. After

| Symptom | Before | After | Resolved By |
|---|---|---|---|
| test_rag_phase4_final_verify in batch | FAIL | PASS | FR2 (tmp_path_factory) |
| test_rag_phase4_stress in batch | FLAKY | PASS | FR3 (io_pool coalescing) |
| test_gui2_set_value_hook_works | FAIL | PASS | FR4 (set_value fix) |
| Per-test subprocess death | POISONS BATCH | RECOVERS | FR1 (autouse respawn) |
| Hardcoded paths in test files | 6 files | 0 files | FR2 (live_gui_workspace fixture) |
| io_pool race in _sync_rag_engine | YES | NO | FR3 (token + dirty flag) |

## Known Residual Failures

- `test_mma_concurrent_tracks_execution` (FAIL, separate code path, MMA engine state transitions)
- `test_mma_step_mode_approval_flow` (FAIL, same)
- `test_mma_complete_lifecycle` (FAIL, same)
- `test_z_negative_flows.py` x3 (FAIL, mock provider error path)
- `test_auto_switch_sim` (FAIL, workspace auto-switch logic)

These are documented as separate code paths, NOT test-isolation issues. They are deferred to follow-up tracks.

## Verification

```powershell
uv run .\scripts\run_tests_batched.py 2>&1 | Tee-Object -FilePath "tests\artifacts\post_track_batch_20260609.log"

Full log saved to tests/artifacts/post_track_batch_20260609.log.

Conclusion

The 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) can start from a clean baseline. The "test regression nightmare" is killed for the categories the user identified: state pollution, path hygiene, and io_pool race.

Step 7.2.2: Commit the report

cd C:\projects\manual_slop; git add docs/reports/test_bed_health_20260609.md tests/artifacts/post_track_batch_20260609.log
git commit -m "docs(report): test_bed_health_20260609 - post-track batch status"
$h = git log -1 --format='%H'
git notes add -m "Captures the post-track batch state. All 3 root causes of test regression churn (state pollution, path hygiene, io_pool race) are fixed. The 4 upcoming tracks can start from a clean baseline." $h

Phase 7 verification

Step 7.V.1: Tier-1, tier-2, tier-3 batch results captured in the report
Step 7.V.2: 0 new failures vs. baseline (Phase 0 capture)
Step 7.V.3: At least 3 previously-failing tests now pass in batch (the "after" row of the table)

Phase 8: Docs + extension of `check_test_toml_paths.py`

Focus: Update the existing audit script to flag the hardcoded-path anti-pattern, and refresh the testing guide.

Task 8.1: Extend `scripts/check_test_toml_paths.py` to flag `Path("tests/artifacts/")` and `Path("C:/projects/")`

Files:

Modify: scripts/check_test_toml_paths.py
Step 8.1.1: Read the existing script Use manual-slop_get_file_summary on the script. Identify the regex/pattern matching logic.
Step 8.1.2: Add the new patterns Add to the script's pattern list:
- r'Path\(["\']tests/artifacts/["\']\)'
- r'Path\(["\']C:[/\\]+projects'
These patterns match test files that hardcode the workspace path or the user's project root.
Step 8.1.3: Run the audit to verify it flags the right files
```
cd C:\projects\manual_slop; uv run python scripts/check_test_toml_paths.py --strict
```
Expected: 0 violations (the 6 files were updated in Phase 3).

Step 8.1.4: Write a TDD test for the audit Create tests/test_check_test_toml_paths.py:

def test_audit_flags_hardcoded_workspace_path(tmp_path):
    bad_file = tmp_path / "test_bad.py"
    bad_file.write_text('workspace = Path("tests/artifacts/live_gui_workspace")\n')
    # Run the audit on tmp_path
    result = subprocess.run(
        ["python", "scripts/check_test_toml_paths.py", "--strict", str(tmp_path)],
        capture_output=True, text=True
    )
    assert result.returncode != 0
    assert "test_bad.py" in result.stdout

def test_audit_passes_clean_file(tmp_path):
    good_file = tmp_path / "test_good.py"
    good_file.write_text("workspace = live_gui_workspace\n")
    result = subprocess.run(
        ["python", "scripts/check_test_toml_paths.py", "--strict", str(tmp_path)],
        capture_output=True, text=True
    )
    assert result.returncode == 0

Step 8.1.5: Run the test to confirm it PASSES

cd C:\projects\manual_slop; uv run pytest tests/test_check_test_toml_paths.py -v --timeout=15

Expected: 2 tests PASS.

Step 8.1.6: Commit

cd C:\projects\manual_slop; git add scripts/check_test_toml_paths.py tests/test_check_test_toml_paths.py
git commit -m "feat(audit): flag hardcoded workspace and project-root paths in tests"
$h = git log -1 --format='%H'
git notes add -m "Extends check_test_toml_paths.py to also flag Path('tests/artifacts/...') and Path('C:/projects/...') in test files. These are the two anti-patterns that the 6 test files in Phase 3 used to violate. Two new tests verify the audit." $h

Task 8.2: Update `docs/guide_testing.md` to document the new fixtures

Files:

Modify: docs/guide_testing.md
Step 8.2.1: Read the existing guide Use manual-slop_get_file_summary to map the structure.
Step 8.2.2: Add a new section "8. Per-test subprocess resilience" Document:
- The _LiveGuiHandle class.
- The _check_live_gui_health autouse fixture.
- The live_gui_workspace fixture.
- The clean_baseline marker.
~50 lines of new content.

Step 8.2.3: Commit

cd C:\projects\manual_slop; git add docs/guide_testing.md
git commit -m "docs(testing): document live_gui handle + workspace fixture + clean_baseline marker"
$h = git log -1 --format='%H'
git notes add -m "Adds a new section to guide_testing.md documenting the _LiveGuiHandle, _check_live_gui_health, live_gui_workspace, and clean_baseline marker. The section is placed in §8 (after the 7 conftest fixtures in §7)." $h

Phase 8 verification

Step 8.V.1: check_test_toml_paths.py --strict passes with 0 violations
Step 8.V.2: 2 new tests for the audit pass
Step 8.V.3: docs/guide_testing.md updated

Final Verification

All 5 FR1-FR5 implemented with TDD tests
All 4 audits committed in Phase 1
Test bed health report written and committed
docs/guide_testing.md updated
No new failures in tier-1 / tier-2 / tier-3 batch
At least 3 previously-failing tests now pass in batch

The track is done when the user reviews the test_bed_health report and confirms that the 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) can start from a clean baseline.

Execution Constraints

Tier 2 supervision required for: Phase 1 (audit review), Phase 3 (conftest refactor), Phase 4 (io_pool race fix). These are the highest-risk phases.
Per-task atomic commits. One commit per task, never batch.
Commit message format: <type>(<scope>): <imperative description>.
Git note format: 3-8 lines per commit.
Style baseline: 1-space indent, no comments, type hints, CRLF on Windows.
TDD discipline: Failing test first. No implementation before the red phase is confirmed.
No diagnostic noise in production. All diagnostic stderr goes to tests/artifacts/*.diag.log, never to src/*.py. Per AGENTS.md "No Diagnostic Noise in Production" rule.
Deduction loop cap: 2 test runs per investigation. If a test fails twice, read the code, predict the failure mode, instrument in one pass, then run a third time. If it still fails, escalate to the user.
Conftest corruption safety: Before ANY edit to tests/conftest.py, run git stash (or git add . && git commit --allow-empty). If the edit fails, git stash pop and re-investigate. The previous attempt at the conftest refactor was reverted due to corruption.

46 KiB Raw Blame History

Test Infrastructure Hardening — Implementation Plan

Pre-Phase 0: Tier 2 checkpoint + dirty-state audit

Phase 1: Audit (no code changes)

Task 1.1: Enumerate live_gui test cross-file state dependencies

Task 1.2: Document the current live_gui_workspace path-hygiene state

Task 1.3: Document the current _sync_rag_engine race

Task 1.4: Document the set_value hook for ai_input

Phase 1 verification

Phase 2: FR1 — Per-test subprocess health check + respawn

Task 2.1: Add a _LiveGuiHandle class with ensure_alive()

Task 2.2: Add the autouse _check_live_gui_health fixture

Phase 2 verification

Phase 3: FR2 — live_gui_workspace fixture + update 6 test files

Task 3.1: Refactor live_gui to use tmp_path_factory.mktemp

Task 3.2: Expose live_gui_workspace as a separate fixture

Task 3.3: Update the 6 dependent test files

Phase 3 verification

Phase 4: FR3 — Coalesce _sync_rag_engine calls

Task 4.1: Add a token-based coalescing mechanism

Phase 4 verification

Phase 5: FR4 — Fix set_value hook for ai_input

Task 5.1: Reproduce the failure with a minimal test

Task 5.2: Apply the fix

Phase 5 verification

Phase 6: FR5 — Optional clean_baseline marker

Task 6.1: Add the marker and the autouse fixture

Phase 6 verification

Phase 7: FR6 — Run full batch + produce test_bed_health report

Task 7.1: Run the full batched suite

Task 7.2: Produce the test_bed_health report

Conclusion

Phase 7 verification

Phase 8: Docs + extension of check_test_toml_paths.py

Task 8.1: Extend scripts/check_test_toml_paths.py to flag Path("tests/artifacts/") and Path("C:/projects/")

Task 8.2: Update docs/guide_testing.md to document the new fixtures

Phase 8 verification

Final Verification

Execution Constraints

46 KiB

Raw Blame History

Task 1.1: Enumerate `live_gui` test cross-file state dependencies

Task 1.2: Document the current `live_gui_workspace` path-hygiene state

Task 1.3: Document the current `_sync_rag_engine` race

Task 1.4: Document the `set_value` hook for `ai_input`

Task 2.1: Add a `_LiveGuiHandle` class with `ensure_alive()`

Task 2.2: Add the autouse `_check_live_gui_health` fixture

Phase 3: FR2 — `live_gui_workspace` fixture + update 6 test files

Task 3.1: Refactor `live_gui` to use `tmp_path_factory.mktemp`

Task 3.2: Expose `live_gui_workspace` as a separate fixture

Phase 4: FR3 — Coalesce `_sync_rag_engine` calls

Phase 5: FR4 — Fix `set_value` hook for `ai_input`

Phase 6: FR5 — Optional `clean_baseline` marker

Phase 8: Docs + extension of `check_test_toml_paths.py`

Task 8.1: Extend `scripts/check_test_toml_paths.py` to flag `Path("tests/artifacts/")` and `Path("C:/projects/")`

Task 8.2: Update `docs/guide_testing.md` to document the new fixtures