28 KiB
Track Specification: Test Infrastructure Hardening (2026-06-09)
Status: SPEC FOR APPROVAL. The user has asked for a single track to "kill the test regression nightmare" so the 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) can land on a clean test bed.
Inheritance: This track absorbs and supersedes:
docs/reports/test_infra_hardening_foundation_20260608.md(foundation, 5 phases proposed)docs/reports/batch_resilience_plan_20260608.md(4 solutions; Solution A + C recommended)docs/reports/rag_test_batch_failure_status_20260609_pm3.md(filesystem hygiene findings #1-5)docs/reports/rag_work_final_20260609_pm.md(remaining failures: io_pool race, set_value hook)- The implicit "fix test in batch" goal that has been chasing the Tier 2 for 4+ days
Overview
The test suite has accumulated 49+ live_gui tests that share a single session-scoped subprocess. Recent regression hunts have surfaced 3 distinct failure modes that keep re-emerging under different masks:
- Subprocess state pollution — the 4 sims in
test_extended_sims.pymutate controller state (current_provider,ui_*attrs, MMA workflows, RAG sync); subsequent tests in the same batch read dirty state. - Filesystem hygiene — the
live_guifixture createstests/artifacts/live_gui_workspace/as a HARDCODED relative path; 6 test files re-derive the path independently;RAGEngine.index_filejoinsbase_dir + file_pathwithbase_dirpossibly being a relative path, so indexing silently no-ops in batch (the root cause of the RAG test batch failure). - io_pool race in
_sync_rag_engine— multiple setters in quick succession submit parallel sync tasks, last-finished-wins, indexing is non-deterministic.
Each of these has been "fixed" in isolation (RAG dim-mismatch recursion, CWD fallback, embedding provider error surface, ini_content str/bytes sentinel, indent on _capture_workspace_profile) but the underlying architectural problems remain. The Tier 2 keeps finding new symptoms.
This track kills the nightmare by fixing the three root causes with surgical, contained, testable changes that the 4 upcoming tracks need as a precondition.
Current State Audit (as of 2026-06-09)
Already Implemented (DO NOT re-implement)
- ✅
live_guifixture exists attests/conftest.py:282(session-scoped) - ✅ Fixture kills subprocess on teardown (
tests/conftest.py:516-547) - ✅
/api/gui_healthendpoint surfaces degraded state (commit1c565da7) - ✅ Pre-flight
get_gui_health()check intest_full_live_workflow(commit51ecace4) - ✅
try/exceptaroundimmapp.run(commit1c565da7) - ✅
_UI_FLAG_DEFAULTSallowlist for__getattr__(commitbcdc26d0) - ✅
_ini_capture_readydefer-not-catch flag forimgui.save_ini_settings_to_memory(commitd7487af4) - ✅
_capture_workspace_profileindent fix (sub-track 1 oflive_gui_test_hardening_v2, commit26e0ced4) - ✅
ini_contentstr/bytes contract test (tests/test_workspace_profile_serialization.py) - ✅
LogPrunerbusy-loop backoff (commitac08ee87) - ✅ RAG dim-mismatch wipe (commit
64bc04a6) - ✅ RAG
_validate_collection_dimrecursion fix (commit644d88ab) - ✅ RAG
index_fileCWD fallback (commiteb8357ec, uncommitted as of report; needs to be committed as defensive fix) - ✅
sentence-transformersavailable in dev env via[local-rag]extra (commita341d7a7) - ✅
_sync_rag_enginesurfaces embedding_provider init failure (commite62266e8) - ✅
test_required_test_dependencies.pyenforces test-time deps (commitb801b11c) - ✅
isolate_workspace,reset_paths,reset_ai_client,vloggerautouse fixtures - ✅
audit_main_thread_imports.pyandaudit_weak_types.pystatic CI gates - ✅
check_test_toml_paths.pyaudit script (CI gate for real-TOML references) - ✅ Batch tier-1 + tier-2 + tier-3 + tier-H + tier-P structure (
scripts/run_tests_batched.py)
Gaps to Fill (This Track's Scope)
Gap 1: live_gui subprocess scope + per-test dirty-state guard
- What exists: Session-scoped
live_guifixture. Subprocess state survives across 49+ tests. - What's missing: When a test dies (IM_ASSERT, error result, etc.) the subprocess is degraded; subsequent tests in different files get dirty state. The pre-flight
get_gui_health()check is file-local, not test-local, and only checks health, doesn't recover. - Real symptom:
test_rag_phase4_final_verifypasses in isolation, fails in batch.test_gui2_set_value_hook_worksreturns''instead of queued value.test_rag_phase4_stressnon-deterministic indexing.
Gap 2: Filesystem hygiene for live_gui_workspace
- What exists:
tests/conftest.py:412hardcodesPath("tests/artifacts/live_gui_workspace"). 6 test files re-derive the same path independently. - What's missing: The path is relative to CWD. When the test runner or prior tests shift CWD, all downstream path joins break.
RAGEngine.index_filejoinsbase_dir + file_path; whenbase_diris relative and CWD has drifted, the file doesn't exist, indexing silently no-ops. - Real symptom: RAG test in batch finds 0 documents in collection.
chroma_test_final_verifycount=0.chroma_dbcollection count=0.chroma_test_stresscount=0. Onlychroma_manual_slop(the user's project, NOT a test) has 328 docs from a separate session. - Files affected:
tests/conftest.py:412(HARDCODED)tests/test_rag_phase4_final_verify.py:20tests/test_rag_phase4_stress.py:21tests/test_saved_presets_sim.py:14, 121tests/test_tool_presets_sim.py:13tests/test_visual_sim_gui_ux.py:79
Gap 3: _sync_rag_engine io_pool race
- What exists:
src/app_controller.py_sync_rag_enginesubmits a sync task to_io_poolfor eachset_valuethat mutatesrag_config. Multiple setters in quick succession → multiple parallel sync tasks → non-deterministic indexing. - What's missing: A coalescing/debounce pattern that serializes sync attempts within a short window (e.g., 100ms).
- Real symptom: Test fires 5 setters (
rag_collection_name,files,rag_enabled,rag_source,rag_emb_provider) in succession. Each submits a sync. The last one to finish wins, but indexing happens against whichever engine finished last. The test then asserts on the wrong engine's output.
Gap 4: set_value hook test failure (pre-existing, separate code path)
- What exists:
test_gui2_set_value_hook_worksline 41 —set_valuereturns'queued'butget_value('ai_input')returns''after 1.5s. - What's missing: A
setattrrouting issue ingui_2.pysimilar to the earlier_UI_FLAG_DEFAULTSfix. The test's input doesn't actually reach the controller. - Real symptom: Test fails in batch; same class of bug as the
_UI_FLAG_DEFAULTSallowlist bug (commitbcdc26d0).
Gap 5: Tests assert against dirty subprocess state from prior tests
- What exists: Test isolation is implicit (assumes clean state from prior fixture). When a prior test's
set_valuecalls pollute the controller, subsequent tests fail in ways unrelated to their code. - What's missing: A
_reset_controller_statehook that thelive_guifixture exposes, so each test can opt-in to a clean baseline.
Goals
- Goal A: Per-test subprocess resilience. Make the
live_guifixture recover from a degraded subprocess BEFORE each test (not just before each file). When the subprocess dies mid-test, the next test gets a fresh one. - Goal B: Path hygiene for the live_gui workspace. Refactor
tests/conftest.py:live_guito usetmp_path_factory.mktemp("live_gui_workspace")and expose the path as a separate fixture. Update all dependent test files to consume the fixture instead of hardcoding the path. - Goal C: Eliminate
_sync_rag_enginerace. Add a coalescing/debounce pattern so 5 setters in 100ms produce 1 sync, not 5 parallel syncs. - Goal D: Fix
set_valuehook routing. Find the__setattr__bug that causesset_value('ai_input', ...)to not actually mutate the controller'sai_inputstate, and fix it the same way_UI_FLAG_DEFAULTSwas fixed. - Goal E: Test files assert against fresh state. Add a
_reset_controller_statefixture that any test can opt into via autouse-on-marker (@pytest.mark.clean_baseline). - Goal F: Verify all 4 upcoming tracks have a clean test bed. Run the full tier-1 + tier-2 + tier-3 batch and document which tests pass in batch vs. isolation. The 4 upcoming tracks (qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) start with a known green baseline.
Non-Goals (Out of Scope)
- ❌ Refactoring the
live_guifixture to per-file scope (Solution A inbatch_resilience_plan_20260608.md). Solution D (autouse health check + respawn) is the surgical alternative; per-file is too coarse. - ❌ Refactoring
src/rag_engine.pyto a chunk-based data structure (that's thechunkification_optimization_20260608_PLACEHOLDERtrack). - ❌ Migrating
live_guitests to mock-based tests (preserves the integration value). - ❌ Adding CI infrastructure (this repo has no CI; manual batch runs are the verification).
- ❌ Fixing the 7 mock_app tests in
test_z_negative_flows.py(separate code path; deferred). - ❌ Fixing the 5 MMA pipeline tests that don't reach "tracks" state (separate code path; deferred).
- ❌ Fixing the
auto_switch_simtest (separate code path; deferred). - ❌ Doing the
code_path_audit_20260607work (post-4-tracks; the audit is the post-condition).
Functional Requirements
FR1. Per-test subprocess health check + respawn
Where: tests/conftest.py:282 (the live_gui fixture)
What: Add an autouse fixture that runs AFTER live_gui and BEFORE each test that uses it. The fixture:
- Calls
client.get_gui_health()with a 1s timeout. - If health is "degraded" OR the response is None OR the call raises, calls
_respawn_subprocess(). - After respawn (or if health was already OK), verifies the subprocess is alive via the existing
kill_process_treemachinery.
API:
@pytest.fixture(autouse=True)
def _check_live_gui_health(request, live_gui):
if "live_gui" in request.fixturenames:
handle, _ = live_gui
handle.ensure_alive() # does the health check + respawn
yield
Tests required:
test_live_gui_respawn_after_kill: kill the subprocess via the handle, run a no-op test that useslive_gui, assert the subprocess is alive at test end.test_live_gui_health_check_fast_path: when the subprocess is alive, the health check is <100ms.test_live_gui_no_respawn_on_clean: when the subprocess is alive ANDget_gui_health()returns OK, no respawn happens (verify via arespawn_countcounter on the handle).
FR2. Expose live_gui_workspace as a separate fixture
Where: tests/conftest.py:282 (the live_gui fixture), plus 6 test files
What:
- Change
live_guito create the workspace viatmp_path_factory.mktemp("live_gui_workspace")instead ofPath("tests/artifacts/live_gui_workspace"). - Add a new fixture
live_gui_workspacethat yields the absolute path to the workspace. - The
live_guifixture useschdir(or sets the subprocess CWD) to the absolute path; the subprocess inherits the correct CWD. - Update 6 test files to accept
live_gui_workspaceas a fixture parameter and use the absolute path instead of the hardcoded one.
Tests required:
test_live_gui_workspace_is_absolute: assert the workspace path is absolute.test_live_gui_workspace_unique_per_session: assert two consecutive sessions get different workspace dirs (per-sessionmktempreturns unique dirs).test_live_gui_workspace_passed_to_test: parametrize a test withlive_gui_workspace, assert the test can create files in it.
Files to update:
tests/conftest.py:412— replacePath("tests/artifacts/live_gui_workspace")withtmp_path_factory.mktemp("live_gui_workspace")tests/test_rag_phase4_final_verify.py:20— acceptlive_gui_workspacefixturetests/test_rag_phase4_stress.py:21— acceptlive_gui_workspacefixturetests/test_saved_presets_sim.py:14, 121— acceptlive_gui_workspacefixturetests/test_tool_presets_sim.py:13— acceptlive_gui_workspacefixturetests/test_visual_sim_gui_ux.py:79— acceptlive_gui_workspacefixture
FR3. Coalesce _sync_rag_engine calls
Where: src/app_controller.py:_sync_rag_engine (or the setter that triggers it)
What: Replace the immediate-submit pattern with a debounce/coalesce pattern. Multiple setters within a 100ms window produce ONE sync, run on the next idle moment.
Approach: Add a _rag_sync_token: Optional[int] and a _rag_sync_dirty: bool flag. When a setter mutates rag_config, increment the token and set dirty. A background "sync dispatcher" task (or a deferred submit) reads the token, builds the engine once, sets the engine, and clears the flag. If a new setter comes in while a sync is running, increment the token, set dirty, the running sync sees the new token and re-runs once.
Tests required:
test_sync_rag_engine_coalesces_five_setters: fire 5 setters in 50ms, assert only 1RAGEngine()is constructed.test_sync_rag_engine_rerun_on_token_change: while a sync is running, fire a setter; assert the sync sees the new token and re-runs once.test_sync_rag_engine_idempotent_no_changes: if no setters fire, no sync runs.
FR4. Fix set_value hook routing for ai_input
Where: src/gui_2.py:__setattr__ (or src/app_controller.py:_handle_set_value)
What: Investigate the __setattr__ / __setstate__ chain. The test (tests/test_gui2_set_value_hook_works) calls client.set_value('ai_input', 'hello'), which posts to /api/gui/set_value, which calls controller.<some_method>. The method either doesn't actually mutate ai_input or routes the value to a different attribute (similar to how _UI_FLAG_DEFAULTS was incorrectly returning None).
Likely root cause: Either:
- The
__setattr__allowlist only includes certainui_attrs, andai_inputis not on it, so the assignment is silently dropped. - The
/api/gui/set_valueendpoint has afield != 'ai_input'branch that doesn't call the setter.
Tests required:
test_set_value_hook_ai_input: assert that afterset_value('ai_input', 'hello')and a 0.5s wait,get_value('ai_input')returns'hello'.test_set_value_hook_temperature: same fortemperature.test_set_value_hook_persists: same formodel_name.
Diagnostic test (write first): A test that introspects the controller's __dict__ and the API hook's parameter-to-handler mapping to find the missing branch.
FR5. Optional clean-baseline marker
Where: tests/conftest.py (new fixture), test files that want it
What: Add a @pytest.mark.clean_baseline marker. An autouse fixture detects the marker and calls a _reset_controller_state method on the controller before the test starts. The reset clears: ai_input, ai_status, ai_response, current_provider, current_model, rag_config, files, mma_streams, mma_epic_input, mma_proposed_tracks, plus any field set by a prior test.
API:
@pytest.fixture(autouse=True)
def _clean_baseline(request, live_gui):
if request.node.get_closest_marker("clean_baseline"):
handle, _ = live_gui
handle.client.reset_session() # existing endpoint, plus extended reset
yield
Tests required:
test_clean_baseline_resets_ai_input: setai_input='polluted', mark test withclean_baseline, assertai_inputis''at test start.test_clean_baseline_resets_rag_config: same forrag_config.
FR6. Verify the 4 upcoming tracks have a clean test bed
Where: scripts/run_tests_batched.py (no changes); verification in this track's final phase
What: Run the full tier-1 + tier-2 + tier-3 batch and document which tests pass. Produce a "test bed health report" as a markdown file in docs/reports/test_bed_health_20260609.md. The report lists:
- Tier-1 unit tests: all pass (already verified in
rag_work_final_20260609_pm.md) - Tier-2 mock_app tests: all pass
- Tier-3 live_gui tests: pass/fail per file, with the failure mode
- A "before" / "after" diff so the user can see the impact
Non-Functional Requirements
- NFR1: Per-test overhead < 200ms. The autouse
_check_live_gui_healthfixture must add <200ms to each test that useslive_gui. The 49 live_gui tests × 200ms = 9.8s additional batch time. Acceptable. - NFR2: No regressions in tier-1 / tier-2. All unit tests and mock_app tests must continue to pass. The fixture change is additive, not destructive.
- NFR3: Backward compat for tests that don't opt in. Tests that don't use
live_guiare unaffected. Tests that uselive_guibut don't opt intoclean_baselinecontinue to work (they just don't get a reset). - NFR4: No hardcoded paths to C:/projects/manual_slop or ./tests/artifacts/ in production code. The track's filesystem-hygiene fix is enforced by the existing
scripts/check_test_toml_paths.pyaudit (extended to also catchPath("tests/artifacts/")andPath("C:/projects/")in test files). - NFR5: 1-space indentation. All Python code in this track uses 1-space indentation per
conductor/product-guidelines.md. - NFR6: CRLF line endings on Windows. All Python files in this track use CRLF.
Architecture Reference
This track touches the following subsystems (see linked deep-dive guides):
- Test infrastructure:
tests/conftest.py,scripts/run_tests_batched.py. See docs/guide_testing.md §"7 conftest fixtures" and §"Puppeteer pattern". - AppController state delegation:
src/app_controller.py(166KB). See docs/guide_app_controller.md §"_predefined_callbacks / _gettable_fields Hook API registries" and docs/guide_state_lifecycle.md §"State Delegation (getattr/setattr)". - RAG engine:
src/rag_engine.py. See docs/guide_rag.md §"RAGEngine lifecycle" and §"Sync to controller". - Hook API:
src/api_hooks.py+src/api_hook_client.py. See docs/guide_api_hooks.md §"/api/gui/set_value" and §"Remote Confirmation Protocol". - io_pool:
src/app_controller.py:_io_pool. See docs/guide_architecture.md §"Thread domains".
Key design constraints inherited
- Defer-not-catch pattern:
imgui.*calls before ImGui is ready crash at the C level (0xc0000005). The_check_live_gui_healthfixture must NOT touch ImGui directly. It uses the existing Hook API (/api/gui_health,/api/status) which runs in the hook server thread, not the render thread. - Session-scoped fixture:
live_guiis session-scoped by design. Per-file or per-test scoping would break cross-test state (e.g.,test_full_live_workflowexpects a freshlive_gui, buttest_rag_phase4_stressdepends on the same subprocess the prior 4 sims used). The autouse respawn is the surgical solution. - tmp_path_factory scope:
tmp_path_factory.mktemp()is session-scoped (per the pytest docs). Per-testtmp_pathis a different fixture. Thelive_gui_workspacefixture must usetmp_path_factoryto be consistent with the session-scopedlive_gui.
Key prior decisions to respect
- The
_UI_FLAG_DEFAULTSallowlist was a HARD-CODED set. The newset_valuehook fix should follow the same allowlist pattern (consistency with the existing fix) OR use a class-level attribute that derives from__init__annotations (the better fix, but the user has not asked for the better fix; this track stays surgical). - The existing
run_tests_batched.pytier structure (tier-1 unit, tier-2 mock_app, tier-3 live_gui, tier-H headless, tier-P perf) is NOT to be restructured. The track works WITH the existing tier structure. - The
audit_main_thread_imports.pyandaudit_weak_types.pystatic CI gates are the project's enforcement mechanism. The newPath("tests/artifacts/")andPath("C:/projects/")patterns are added tocheck_test_toml_paths.py(extended) as a third gate.
Out of Scope
The following are explicitly NOT part of this track. They are mentioned so the user knows they are deferred, not forgotten:
- Per-file
live_guifixture scope (Solution A frombatch_resilience_plan_20260608.md): Not needed if the per-test autouse respawn works. May revisit if the per-test respawn has too much overhead. - Refactoring
live_guifixture to a class-based handle with respawn (Solution B): Same — only do if per-test respawn is insufficient. - MMA pipeline tests that don't reach "tracks" state: 3 tests fail in this pattern (
test_mma_concurrent_tracks_execution,test_mma_step_mode_approval_flow,test_mma_complete_lifecycle). These are MMA-engine-state-transition bugs, not test-isolation bugs. Out of scope. - Negative-flows tests (
test_z_negative_flows.py): 3 tests fail in this pattern. They exercise the mock provider's error path. Pre-existing, separate code path. Out of scope. test_auto_switch_sim: Workspace auto-switch logic not applying Tier 3 profile. Pre-existing, separate code path. Out of scope.test_prior_session_no_pop_imbalance: Already addressed inlive_gui_test_hardening_v2(commit26e0ced4). Verify it still passes.code_path_audit_20260607: Post-4-tracks audit. This track unblocks the 4 tracks; the audit runs after.chunkification_optimization_20260608_PLACEHOLDER: The comms.log chunkification. Out of scope; the user has not approved it.manual_ux_validation_20260608_PLACEHOLDER: The ASCII-sketch workflow. Out of scope; the user has not approved it.- CI infrastructure: No CI in this repo. Manual batch runs are the verification.
Verification Criteria
This track is "done" when ALL of the following are true:
- ✅ All tier-1 unit tests pass in batch (no regression).
- ✅ All tier-2 mock_app tests pass in batch (no regression).
- ✅ The 6 test files that hardcoded
Path("tests/artifacts/live_gui_workspace")now use thelive_gui_workspacefixture. - ✅
test_rag_phase4_final_verify.py::test_phase4_final_verifypasses in BATCH (after 4 sims) — the primary symptom the user wanted fixed. - ✅
test_rag_phase4_stress.pypasses in batch OR has a documented reason for the residual flakiness (acceptable perrag_work_final_20260609_pm.md's "out of scope" decision IF the io_pool race fix in FR3 lands). - ✅
test_gui2_set_value_hook_workspasses in batch. - ✅ The autouse
_check_live_gui_healthfixture is in place; a new test (test_live_gui_respawn_after_kill) verifies it. - ✅ The
_sync_rag_enginecoalescing fix is in place; a new test (test_sync_rag_engine_coalesces_five_setters) verifies it. - ✅ A
docs/reports/test_bed_health_20260609.mdreport is committed, listing pass/fail per test file with the failure mode for any residual failures. - ✅
scripts/check_test_toml_paths.pyis extended to flagPath("tests/artifacts/")andPath("C:/projects/")in test files; the audit passes.
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Per-test respawn adds too much overhead (>200ms × 49 tests = 10s) | Medium | Low | Verify with the NFR1 measurement; if exceeded, fall back to per-batch respawn |
| Per-test respawn breaks cross-test state dependencies | Medium | High | Add a --no-respawn pytest flag for tests that need cross-test state; audit the 49 live_gui tests for state dependencies before Phase 1 |
tmp_path_factory.mktemp changes the workspace path, breaking the on-disk chroma DB persistence assumption |
High | Low | Clear .slop_cache/ dirs at session start; OR add a live_gui_workspace_persist opt-in |
_sync_rag_engine coalescing breaks the existing RAG test that DEPENDS on multiple parallel syncs (unlikely) |
Low | Medium | Write the FR3 tests to verify both "5 setters → 1 sync" AND "single setter → single sync" still work |
set_value hook fix changes behavior for existing tests that assert on the OLD (broken) behavior |
Low | High | Run the full tier-3 batch in Phase 3 and verify no regressions |
The tmp_path_factory.mktemp refactor corrupts tests/conftest.py (the previous attempt at this refactor DID corrupt it; commit was reverted per rag_test_batch_failure_status_20260609_pm3.md) |
High | High | Use git stash before each edit; if edit fails, git stash pop and try again with manual-slop_set_file_slice (which is the recommended surgical tool per conductor/edit_workflow.md) |
Phases (summary)
This spec is the entry point. The plan (plan.md) breaks these into TDD-ready tasks.
| Phase | Scope | Effort |
|---|---|---|
| Phase 1 | Audit: enumerate all live_gui cross-test state dependencies, document baseline failure modes |
1 day |
| Phase 2 | FR1: Per-test subprocess health check + respawn (autouse fixture) | 1 day |
| Phase 3 | FR2: Expose live_gui_workspace as a separate fixture, update 6 test files |
1 day |
| Phase 4 | FR3: Coalesce _sync_rag_engine calls (token + dirty flag pattern) |
1 day |
| Phase 5 | FR4: Fix set_value hook routing for ai_input |
1 day |
| Phase 6 | FR5: Optional clean_baseline marker |
0.5 day |
| Phase 7 | FR6: Run full batch, produce test_bed_health report | 0.5 day |
| Phase 8 | Docs: update docs/guide_testing.md + docs/guide_state_lifecycle.md |
0.5 day |
Total: 6.5 days (fits within 1 sprint).
See Also
- Foundation: docs/reports/test_infra_hardening_foundation_20260608.md — original 5-phase plan; this spec supersedes with sharper scope.
- Batch resilience: docs/reports/batch_resilience_plan_20260608.md — 4 solutions; this spec adopts Solution D (autouse respawn) as primary.
- RAG failure status: docs/reports/rag_test_batch_failure_status_20260609_pm3.md — the filesystem hygiene findings that drive FR2.
- RAG final report: docs/reports/rag_work_final_20260609_pm.md — the io_pool race that drives FR3.
- Process anti-patterns: conductor/workflow.md §"Process Anti-Patterns (Added 2026-06-09)" — the Deduction Loop and Report-Instead-of-Fix patterns this track is designed to prevent.
- Edit workflow: conductor/edit_workflow.md — the surgical tool guidance; the conftest refactor MUST use
manual-slop_set_file_sliceafter the previous attempt was reverted due to corruption. - Architecture deep-dive: docs/guide_testing.md §"7 conftest fixtures" + docs/guide_state_lifecycle.md §"State Delegation".
- 4 upcoming tracks:
- qwen_llama_grok_integration_20260606 — spec ✓
- data_oriented_error_handling_20260606 — plan ✓
- data_structure_strengthening_20260606 — plan pending
- mcp_architecture_refactor_20260606 — plan pending
Approval Required
This spec requires user approval before the plan is written. Per the conductor workflow:
The spec is the agent's design intent — it explains WHY, not just WHAT. A plan for an unapproved spec is wasted effort.
The user has asked for a track to "kill the test regression nightmare." This spec defines what "kill" means: 5 surgical fixes (FR1-FR5) + a verification report (FR6) that produces a clean test bed for the 4 upcoming tracks. If the user wants more aggressive scope (e.g., refactoring live_gui to per-file scope), revise the spec before approving.