conductor(track): init mma_quarantine_rag_test_decoupling_20260701 (spec + metadata + state + tracks.md row)

Track artifacts for the MMA quarantine + RAG test decoupling effort. Design doc lives at docs/superpowers/specs/ (historical record preserved). Plan.md pending user spec approval.
2026-07-01 18:54:05 -04:00
parent 7c046ee7b4
commit e5f37e7443
4 changed files with 201 additions and 0 deletions
@@ -16,6 +16,7 @@ Tracks that are unblocked and ready to start. Ordered by **dependency** (blocked

 | # | Priority | Track | Status | Blocked By |
 |---|---|---|---|---|
+| 36 | A (sunset) | [MMA Quarantine + RAG Test Decoupling](#track-mma-quarantine--rag-test-decoupling) | spec, plan pending; user-directed sunset of MMA automation engine + decoupling of RAG tests from live_gui/chromadb; design doc at `docs/superpowers/specs/2026-07-01-mma-quarantine-rag-test-decoupling-design.md` | (none - independent) |
 | 2 | A | [Qwen, Llama & Grok Vendor Integration + Capability Matrix](#track-qwen-llama-grok-vendor-integration--capability-matrix) | spec Γ£ô, plan Γ£ô, 50/79 tasks done; **Phase 6 in progress (docs); NOT archiving ΓÇö has follow-up track** | **test_infrastructure_hardening_20260609 (merged)** |
 | 3 | A | [Data-Oriented Error Handling (Fleury Pattern)](#track-data-oriented-error-handling-fleury-pattern) | spec Γ£ô, plan Γ£ô, ready to start | startup_speedup, test_batching_refactor, **test_infrastructure_hardening_20260609 (merged)**, qwen_llama_grok |
 | 4 | A | [MCP Architecture Refactor (Sub-MCP Extraction)](#track-mcp-architecture-refactor-sub-mcp-extraction) | spec Γ£ô, plan pending | test_infrastructure_hardening_20260609 (merged), data_oriented_error_handling, data_structure_strengthening |
@@ -76,6 +77,23 @@ Tracks that are unblocked and ready to start. Ordered by **dependency** (blocked
 | 35 | A (refactor) | [Metadata Promotion: dict[str, Any] → per-aggregate @dataclass](#track-metadata-promotion-2026-06-24) | spec ✓, plan ✓, metadata ✓, state ✓, **SHIPPED 2026-06-25** by Tier 2 autonomous mode; 13 phases, 32 tasks, 10 atomic commits; **Phase 0** added 12 NEW per-aggregate dataclasses (11 in src/type_aliases.py + RAGChunk in src/rag_engine.py; +158 lines); 11 new test files with 70+ regression tests (all PASS); updated test_type_aliases.py (6 tests); regenerated type_registry (22→23 files). **Phases 1-10** were NO-OPS per audit: most consumer sites operate on dicts at I/O boundaries (session log entries from JSONL, multimodal content with `is_image`/`base64_data` keys, MCP wire protocol, project config from `manual_slop.toml`), correctly classified as collapsed-codepath per FR2. **Phase 11** audited 253 remaining access sites (125 .get() + 128 []); all classified as collapsed-codepath with file-level justification. **VC7 PARTIAL**: effective codepaths UNCHANGED at 4.014e+22 (metric dominated by `2^N` for highest-branch-count functions in app_controller.py and gui_2.py; reducing `.get()` access sites alone does NOT reduce branch count — dispatchers still need `if entry.get(...)` or `if isinstance(entry, X)` checks regardless of dict-vs-dataclass; actual reduction requires TYPED PARAMETERS at function boundaries, out of scope). **Other VCs**: 7/7 audit gates pass --strict; 103 tests pass (70 NEW + 14 updated + 19 openai_schemas); tier 1+2 batched tests not re-verified (Phase 2 baseline still applies). TRACK_COMPLETION at `docs/reports/TRACK_COMPLETION_metadata_promotion_20260624.md` | `code_path_audit_phase_3_provider_state_20260624` (recommended prerequisite, SHIPPED 2026-06-25) | (**NEW 2026-06-24, SHIPPED 2026-06-25**; corrected 2026-06-25 per Tier 1 audit; per-aggregate dataclasses for known sub-aggregates; `Metadata: TypeAlias = dict[str, Any]` preserved unchanged as the catch-all for collapsed codepaths; the 12 NEW dataclasses are AVAILABLE for future code that wants typed access; existing dict-style consumers are correct per FR2; the effective codepaths metric cannot be reduced by adding dataclasses alone — it requires typed parameters at function boundaries; **scope reality check**: spec estimated ~213 access site migrations; actual migrations = 0 (all sites are correctly classified as collapsed-codepath); the real work was adding the 12 dataclasses for future use) |
 | 32 | A (refactor) | [Metadata Nil Sentinel (SSDL campaign child 1)](#track-metadata-nil-sentinel-ssdl-campaign-child-1-2026-06-24) | spec ✓, plan ✓, metadata ✓, state ✓, **SHIPPED 2026-06-24** by Tier 2 autonomous mode; 3 phases, 3 tasks, 3 atomic commits; NIL_METADATA = {} sentinel defined in `src/aggregate.py:50`; `_build_files_section_from_items` migrated to sentinel pattern (file_items = file_items or []; item = item or NIL_METADATA; if path is None: → if not path:); 5/5 behavioral tests PASS; VC1=true, VC2=true, VC3=true, VC4=FAIL (drop was -0.1%; spec's 10% threshold is mathematically near-impossible due to exponential dominance; campaign spec R4 acknowledges this), VC5=true (Tier 1 + Tier 2 both 5/5; Tier 3 has 1 pre-existing flake that passes in isolation), VC6=true; TRACK_COMPLETION at `docs/reports/TRACK_COMPLETION_metadata_nil_sentinel_20260624.md`; **spec discrepancy noted**: spec said "6 nil-check functions" but SSDL detects 74 across codebase (1 in aggregate.py, 27 in aggregate.py + ai_client.py); 1 was cleanly migratable in aggregate.py | `metadata_ssdl_defusing_20260624` (parent campaign) | (**NEW 2026-06-24**; child 1 of 3; establishes the NIL_METADATA fallback primitive for child 2's generational-handle generation-mismatch path; cumulative campaign effect is the value, not single-child heuristic number; **budget gate recommendation**: child 2 and child 3 should be allowed to ship even if their individual budget gates fail) |

+### Track: MMA Quarantine + RAG Test Decoupling
+
+**ID:** `mma_quarantine_rag_test_decoupling_20260701`
+**Priority:** A (sunset)
+**Status:** spec ✓, plan pending
+**Blocked By:** (none — independent)
+**Files:** `conductor/tracks/mma_quarantine_rag_test_decoupling_20260701/`
+**Design doc:** `docs/superpowers/specs/2026-07-01-mma-quarantine-rag-test-decoupling-design.md`
+
+Two surgical interventions driven by the user's directive (2026-07-01) to sunset the MMA automation engine constructively and stop the RAG tests from bleeding on every test-suite run:
+
+1. **MMA quarantine.** Config flag `mma.enabled` (default `false`) in `[ai_settings.toml]` gates the MMA automation engine (`multi_agent_conductor.py` + ~20 `app_controller.py` state/method sites + 12 `gui_2.py` render functions + MMA Dashboard window + approval modals). Shared types (`mma.py`, `dag_engine.py`, `mma_prompts.py`) stay active because non-MMA code depends on them (`thinking_parser.ThinkingSegment`, `project_manager.TrackState`, `models.TrackMetadata`, `conductor_tech_lead.TrackDAG/Ticket`). MMA tests become opt-in via `SLOP_MMA_TESTS=1`. Full removal is a follow-up track if quarantine maintenance hurts. Per `conductor/code_styleguides/feature_flags.md` §2: persistent preference → config flag + GUI checkbox (single `[ ] Enable MMA (deprecated, quarantined)` checkbox in AI Settings is the only MMA UI surface when the flag is off).
+
+2. **RAG test decoupling.** Three-tier classification of RAG tests so the default batch stops touching chromadb file locks, the live_gui subprocess, and CWD drift. The RAG algorithm (`index_file`, `search`, chunking) is unchanged. Tier 1 (unit tests against mock provider) + Tier 2 (controller lifecycle tests with mock engine) run in the default batch in milliseconds. Tier 3 (integration tests against real chromadb/live_gui — the 3-ADDENDUM fragile ones from 2026-06-27) becomes opt-in via `SLOP_RAG_INTEGRATION=1`.
+
+**Verification criteria** (8 total): MMA dashboard no-render + engine no-op + no `multi_agent_conductor` runtime import when flag off; MMA re-enabled when flag on; no regression in shared-types consumers; default batch RAG tests fast + no chromadb; integration tests opt-in; `RAGEngine` source unchanged; `audit_main_thread_imports.py` passes; MMA tests skip-not-fail by default.
+
 **Note on numbering:** the legacy file used `0a`, `0b`, `0c`... and `0d`, `0e`, `0f`, `0g` for tracks created 2026-06-06+. This is the **git-blame sort order**, not a logical execution order. The new structure re-orders by dependency.

 ---
@@ -0,0 +1,68 @@
+{
+  "track_id": "mma_quarantine_rag_test_decoupling_20260701",
+  "name": "MMA Quarantine + RAG Test Decoupling",
+  "type": "refactor",
+  "scope": {
+    "new_files": [],
+    "modified_files": [
+      "src/app_controller.py",
+      "src/gui_2.py",
+      "tests/test_rag_engine.py",
+      "tests/test_rag_engine_ready_status_bug.py",
+      "tests/test_rag_gui_presence.py",
+      "tests/test_sync_rag_engine_coalescing.py",
+      "tests/test_reset_session_clears_mma_and_rag.py",
+      "tests/test_rag_phase4_final_verify.py",
+      "tests/test_rag_phase4_stress.py",
+      "tests/test_rag_visual_sim.py",
+      "tests/test_rag_integration.py"
+    ],
+    "deleted_files": []
+  },
+  "blocked_by": [],
+  "blocks": [],
+  "pre_existing_failures_remaining": [],
+  "deferred_to_followup_tracks": [
+    {
+      "title": "Full MMA removal (Option A)",
+      "description": "Migrate ThinkingSegment -> thinking_parser.py, TrackState/TrackMetadata -> project_manager.py or models.py, then delete mma.py + multi_agent_conductor.py + dag_engine.py (if conductor_tech_lead rewritten) + mma_prompts.py + MMA UI + app_controller MMA state. Only if quarantine maintenance becomes painful.",
+      "track_status": "not started"
+    }
+  ],
+  "verification_criteria": [
+    "VC1: mma.enabled = false (default) -> MMA dashboard does not render; engine methods no-op; multi_agent_conductor not imported at runtime",
+    "VC2: mma.enabled = true -> MMA dashboard renders; engine starts; MMA tests run when SLOP_MMA_TESTS=1",
+    "VC3: No regression in non-MMA code (thinking_parser, project_manager, models, conductor_tech_lead shared types intact)",
+    "VC4: Default test batch: Tier 1+2 RAG tests run in milliseconds, no chromadb, no subprocess",
+    "VC5: SLOP_RAG_INTEGRATION=1 -> Tier 3 RAG integration tests run (the previously-fragile tests)",
+    "VC6: RAG feature functional end-to-end when rag.enabled = true (RAGEngine source unchanged)",
+    "VC7: scripts/audit_main_thread_imports.py passes (no new startup-time imports from lazy multi_agent_conductor)",
+    "VC8: MMA tests skip (not fail) by default; skip reason documents quarantine + re-enable instructions"
+  ],
+  "estimated_effort": {
+    "method": "scope (per workflow.md Tier 1 Track Initialization Rules). NO day estimates.",
+    "scope": "2 phases; ~20 app_controller sites + 12 gui_2 render functions + ~30 MMA test files gated + ~8 RAG test files reclassified; src/ changes surgical (flag gates + early returns + lazy import)"
+  },
+  "risk_register": [
+    {
+      "id": "R1",
+      "likelihood": "medium",
+      "description": "Lazy import correctness for multi_agent_conductor (circular import or startup-time import when flag is off). Mitigation: audit_main_thread_imports.py post-implementation."
+    },
+    {
+      "id": "R2",
+      "likelihood": "medium",
+      "description": "Shared-types boundary drift (mma.py/dag_engine.py/mma_prompts.py must stay importable by non-MMA consumers). Mitigation: FR5 classification + regression tests."
+    },
+    {
+      "id": "R3",
+      "likelihood": "low",
+      "description": "Test classifier drift (MMA engine tests gate; shared dag_engine type tests stay). Mitigation: per-file classification documented in plan."
+    },
+    {
+      "id": "R4",
+      "likelihood": "low",
+      "description": "Quarantine maintenance cost (rotting quarantined code). Mitigation: follow-up full-removal track if it rots."
+    }
+  ]
+}
@@ -0,0 +1,88 @@
+# Track Specification: MMA Quarantine + RAG Test Decoupling
+
+## Overview
+
+Two surgical interventions:
+
+1. **MMA quarantine.** Feature-flag the MMA automation engine (`multi_agent_conductor.py` + the `app_controller`/`gui_2` wiring) off by default via a config flag `mma.enabled = false` in `[ai_settings.toml]`. The engine stops running, the dashboard stops rendering, the MMA tests become opt-in. Shared types (`mma.py`, `dag_engine.py`, `mma_prompts.py`) stay active because non-MMA code depends on them. Full removal is a follow-up track if quarantine maintenance becomes painful.
+
+2. **RAG test decoupling.** Reclassify the RAG tests into three tiers so the default test batch stops bleeding on chromadb file locks, live_gui subprocess pollution, and CWD drift. The RAG algorithm is unchanged. The integration tests become opt-in via `SLOP_RAG_INTEGRATION=1`.
+
+**Design reference:** `docs/superpowers/specs/2026-07-01-mma-quarantine-rag-test-decoupling-design.md` (the full design doc; this spec is the track artifact that mirrors it).
+
+## Current State Audit (as of 2026-07-01, commit `7c046ee7`)
+
+### Already Implemented (DO NOT re-implement)
+
+- `src/rag_engine.py` mock provider (`provider == 'mock'` short-circuits chromadb; `collection == "mock"`; `is_empty()` returns `True`; `search()` returns `[]`; `index_file()` no-ops). This is the isolated test surface for Tier 1+2 tests — already exists.
+- `conductor/code_styleguides/feature_flags.md` — the config-flag-vs-file-presence decision tree. §2 mandates config flag + GUI checkbox for persistent preferences (MMA quarantine qualifies).
+- `src/mma.py` — shared types module (`Ticket`, `Track`, `TrackState`, `TrackMetadata`, `WorkerContext`, `ThinkingSegment`). Load-bearing for `thinking_parser.py`, `project_manager.py`, `models.py`, `conductor_tech_lead.py`, `dag_engine.py`. NOT gated by this track.
+- `src/dag_engine.py` — `TrackDAG`, `ExecutionEngine`. Shared with `conductor_tech_lead.py`. NOT gated by this track.
+
+### Gaps to Fill (This Track's Scope)
+
+**MMA quarantine gaps:**
+- No `mma.enabled` config flag exists. `app_controller.py` unconditionally imports `multi_agent_conductor` (line 38) and instantiates `ConductorEngine` in `start_mma` paths.
+- `app_controller.py` MMA state fields (`self.engines`, `self.mma_streams`, `self.mma_step_mode`, `self.mma_tier_usage`, `self.tracks`) are always live; engine methods don't gate on a flag.
+- `gui_2.py` 12 `render_mma_*` functions + `render_task_dag_panel` + the MMA Dashboard window registration (line 1995) + approval modals — all unconditional.
+- MMA tests run in the default batch (no env-gate). ~30 test files.
+
+**RAG test decoupling gaps:**
+- `test_rag_engine.py` tests against real chromadb, not the mock provider.
+- `test_rag_phase4_final_verify.py`, `test_rag_phase4_stress.py`, `test_rag_visual_sim.py`, `test_rag_integration.py` run in the default batch and are the recurring bleed source (10 commits + 2 ADDENDUM reports on 2026-06-27).
+- Controller lifecycle tests (`test_rag_engine_ready_status_bug.py`, `test_sync_rag_engine_coalescing.py`, `test_reset_session_clears_mma_and_rag.py`) exercise real engine state instead of a mock.
+
+## Goals
+
+- MMA dashboard does not render and MMA engine does not run when `mma.enabled = false` (the default).
+- Default test batch no longer touches chromadb file locks or the live_gui subprocess for RAG verification.
+- Shared types (`mma.py`, `dag_engine.py`) remain importable by non-MMA consumers — no regression in `thinking_parser`, `project_manager`, `models`, `conductor_tech_lead`.
+- Quarantine is reversible: `mma.enabled = true` re-enables the dashboard + engine + opt-in tests.
+
+## Functional Requirements
+
+### MMA Quarantine
+
+- **FR1:** `mma.enabled` config flag in `[ai_settings.toml]`, default `false`. GUI checkbox `[ ] Enable MMA (deprecated, quarantined)` in AI Settings is the only MMA UI surface when the flag is off.
+- **FR2:** `app_controller.py` — `self.engines`, `self.mma_streams`, `self.mma_step_mode`, `self.mma_tier_usage` stay zero-init when flag is off. Engine-start/approve/abort methods return early. `multi_agent_conductor` imported lazily inside gated methods only.
+- **FR3:** `gui_2.py` — 12 `render_mma_*` functions + `render_task_dag_panel` return early when flag is off. MMA Dashboard window not registered in panel registry when flag is off. Approval modals no-op.
+- **FR4:** MMA tests gated behind `SLOP_MMA_TESTS=1` env var via `@pytest.mark.skipif`. Skip reason documents quarantine status + how to re-enable. Tests of shared `dag_engine` types (not the MMA engine) stay in default batch — classifier: does the test import/instantiate `multi_agent_conductor.ConductorEngine` / `WorkerPool`? If yes → gated. If no → stays.
+- **FR5:** `mma_prompts.py` usages in `ai_client.py`, `conductor_tech_lead.py`, `orchestrator_pm.py` classified during implementation: if MMA-specific → gate; if general → leave.
+
+### RAG Test Decoupling
+
+- **FR6:** Tier 1 unit tests (`test_rag_chunk.py`, `test_rag_engine.py`, `test_rag_engine_result.py`, `test_rag_sync_none_error.py`) use mock provider exclusively. No chromadb. Default batch.
+- **FR7:** Tier 2 controller lifecycle tests (`test_rag_engine_ready_status_bug.py`, `test_rag_gui_presence.py`, `test_sync_rag_engine_coalescing.py`, `test_reset_session_clears_mma_and_rag.py` RAG portion) use mock/stub `RAGEngine`. No chromadb. Default batch.
+- **FR8:** Tier 3 integration tests (`test_rag_phase4_final_verify.py`, `test_rag_phase4_stress.py`, `test_rag_visual_sim.py`, `test_rag_integration.py`) gated behind `SLOP_RAG_INTEGRATION=1` env var via `@pytest.mark.skipif`. Skip reason documents: "Integration test requires real chromadb + live_gui subprocess; run with `SLOP_RAG_INTEGRATION=1` to enable."
+- **FR9:** `RAGEngine` source unchanged. No algorithm changes to `index_file`, `search`, chunking.
+
+## Non-Functional Requirements
+
+- **NFR1:** No regression in non-MMA code paths (`thinking_parser`, `project_manager`, `models`, `conductor_tech_lead`). Verified via `scripts/audit_main_thread_imports.py` after implementation.
+- **NFR2:** `mma.enabled = true` path still works (dashboard renders, engine starts, opt-in tests run with `SLOP_MMA_TESTS=1`).
+- **NFR3:** Default test batch runtime does not increase. Tier 1+2 RAG tests run in milliseconds.
+- **NFR4:** Per `feature_flags.md` §7 forbidden list: no env-var-only runtime flag (the env var is for test opt-in, separate from the runtime config flag per §6 layering).
+
+## Architecture Reference
+
+- `conductor/code_styleguides/feature_flags.md` §2 (config flag for persistent preferences), §6 (layered flags: config for runtime, env var for test opt-in), §7 (forbidden patterns)
+- `docs/guide_mma.md` — the MMA engine architecture being quarantined
+- `docs/guide_rag.md` — the RAG subsystem architecture (unchanged)
+- `docs/superpowers/specs/2026-07-01-mma-quarantine-rag-test-decoupling-design.md` — the full design doc
+- `conductor/tracks/fix_mma_concurrent_tracks_sim_20260627/` — the recent MMA brittleness precedent
+- Git history 2026-06-27 — the RAG test debugging churn (10 commits, 2 ADDENDUM reports)
+
+## Risks
+
+- **R1 (medium):** Lazy import correctness for `multi_agent_conductor` — must not introduce a circular import or a startup-time import when the flag is off. Mitigation: verify via `scripts/audit_main_thread_imports.py` post-implementation.
+- **R2 (medium):** Shared-types boundary drift — `mma.py` / `dag_engine.py` / `mma_prompts.py` must remain importable by non-MMA consumers. The flag gates the engine, not the types. Mitigation: FR5 classification + regression tests for `thinking_parser` / `project_manager`.
+- **R3 (low):** Test classifier drift — the MMA test classifier ("does it import `ConductorEngine` / `WorkerPool`?") must be applied consistently. A test of shared `dag_engine` types stays; a test of the MMA engine gates. Mitigation: per-file classification documented in the plan.
+- **R4 (low):** Quarantine maintenance cost — if the quarantined code rots, follow-up full removal (Option A from the design discussion) becomes necessary. The quarantine is the lower-risk path now, not a permanent commitment.
+
+## Out of Scope
+
+- The discussion/session system redesign (nagent research track)
+- Full removal of MMA code (follow-up track if quarantine maintenance hurts)
+- RAG algorithm changes (`index_file`, `search`, chunking — unchanged)
+- The `conductor/` track system (`conductor_tech_lead.py` / `project_manager.py` / `dag_engine.py` shared-types layer preserved)
+- Migrating `ThinkingSegment` / `TrackState` / `TrackMetadata` out of `mma.py` (follow-up to full removal)
@@ -0,0 +1,27 @@
+# Track state for mma_quarantine_rag_test_decoupling_20260701
+# Updated by Tier 2 Tech Lead as tasks complete
+
+[meta]
+track_id = "mma_quarantine_rag_test_decoupling_20260701"
+name = "MMA Quarantine + RAG Test Decoupling"
+status = "active"
+current_phase = 0
+last_updated = "2026-07-01"
+
+[blocked_by]
+# None - independent track
+
+[blocks]
+# None
+
+[phases]
+phase_1 = { status = "pending", checkpointsha = "", name = "MMA Quarantine (config flag + app_controller/gui_2 gating + lazy import + test env-gate)" }
+phase_2 = { status = "pending", checkpointsha = "", name = "RAG Test Decoupling (3-tier classification + mock provider for unit/lifecycle + env-gate integration)" }
+
+[tasks]
+# Phase 1 tasks populated in plan.md after user approves spec
+# Phase 2 tasks populated in plan.md after user approves spec
+
+[verification]
+phase_1_mma_quarantine_complete = false
+phase_2_rag_test_decoupling_complete = false