conductor(track): spec for rag_test_failures_20260615 (3 RAG test fixes, single root cause)

2026-06-15 21:51:11 -04:00
parent bc388f11bb
commit 006df67637
1 changed files with 386 additions and 0 deletions
@@ -0,0 +1,386 @@
+# Track Specification: RAG Test Failures Fix
+
+**Track ID:** `rag_test_failures_20260615`
+**Status:** Active (spec approved 2026-06-15)
+**Priority:** A (foundational; precedes `data_structure_strengthening_20260606` and the user's planned `send_result` → `send` mass rename)
+**Owner:** Tier 2 Tech Lead
+**Type:** bugfix + test_fix
+**Estimated effort:** 0.5-1 day Tier 2 work (4-8 hours)
+**Parent tracks:** `data_oriented_error_handling_20260606` (shipped 2026-06-12), `ai_loop_regressions_20260614` (shipped 2026-06-15), `doeh_test_thinking_cleanup_20260615` (shipped 2026-06-15), `public_api_migration_and_ui_polish_20260615` (shipped 2026-06-15)
+
+---
+
+## 0. TL;DR
+
+A small, focused bug-fix track that resolves the **3 remaining pre-existing test failures** (not 4 as the parent track documented — `test_rag_integration.py` was inadvertently fixed by the public_api migration's Phase 2 follow-up, commit `26e1b652`).
+
+**All 3 failures share the same root cause:** the RAG sync worker at `src/app_controller.py:_do_rag_sync` catches an exception during the `RAGEngine` construction or subsequent config lookup, and the error message is `"'NoneType' object has no attribute 'get'"`. This is a specific Python error pattern indicating a `dict.get()` call is being made on a `None` value somewhere in the RAG setup path.
+
+**Result:** all 1285 tests pass (1282 + 3 RAG fixed). The project reaches a fully-green baseline for the first time since the `data_oriented_error_handling_20260606` track shipped on 2026-06-12. The user can then proceed with the planned `send_result` → `send` mass rename and the `data_structure_strengthening_20260606` track.
+
+---
+
+## 1. Overview
+
+### 1.1 Current State (as of 2026-06-15)
+
+After the `public_api_migration_and_ui_polish_20260615` track completed:
+- **1282 tests pass** (was 1280 pre-track; 7 newly-passing in the run, 13 fixed total per the completion report)
+- **4 tests skipped** (unchanged)
+- **3 tests fail** (was 10 pre-track; down from 4 RAG failures because `test_rag_integration.py::test_rag_integration` is now passing)
+
+The 3 remaining failures are all RAG subsystem tests in tier-3 (live_gui):
+
+| Test | Tier | File | Failure point |
+|---|---|---|---|
+| `test_rag_phase4_final_verify::test_phase4_final_verify` | tier-3 (live_gui) | `tests/test_rag_phase4_final_verify.py` | Line 65 (after `rag_enabled=True` + wait for `rag_status == 'ready'`) |
+| `test_rag_phase4_stress::test_rag_large_codebase_verification_sim` | tier-3 (live_gui) | `tests/test_rag_phase4_stress.py` | Line 48 (same pattern) |
+| `test_rag_visual_sim::test_rag_full_lifecycle_sim` | tier-3 (live_gui) | `tests/test_rag_visual_sim.py` | Line 32 (initial status check after `rag_enabled=True`) |
+
+All 3 fail with the **same error message** captured in `rag_status`: `"error: 'NoneType' object has no attribute 'get'"`. The error originates in `src/app_controller.py:_do_rag_sync` (line 1479-1482):
+
+```python
+except Exception as e:
+    self._set_rag_status(f"error: {e}")
+    sys.stderr.write(f"[DEBUG RAG] Failed to sync engine: {e}\n")
+    sys.stderr.flush()
+```
+
+### 1.2 Gaps to Fill (this Track's Scope)
+
+| Gap | Count | Spec Section |
+|---|---|---|
+| Investigate the RAG sync NoneType.get error | 1 investigation | §3.1 |
+| Fix the underlying bug in `src/app_controller.py` and/or `src/rag_engine.py` | 1-3 code changes | §3.2 |
+| Verify the 3 RAG tests pass | 3 test fixes | §3.3 |
+
+### 1.3 Already Implemented (DO NOT re-implement)
+
+Verified by code audit (2026-06-15):
+
+- **`RAGConfig` default** (`src/models.py:1039-1065`) — has `vector_store: VectorStoreConfig = field(default_factory=lambda: VectorStoreConfig(provider='mock'))`; the default is NOT `None`. Confirmed by direct instantiation: `RAGConfig().vector_store.provider == 'mock'`.
+- **`RAGEngine.__init__` with `vector_store.provider='mock'`** — succeeds; `is_empty()` returns `True`; no further sync work is triggered (mock branch at `src/rag_engine.py:123-126`).
+- **`_do_rag_sync` coalescing** — the `token + dirty flag` pattern prevents N parallel syncs; works correctly (per `test_infrastructure_hardening_20260609` track).
+- **`_init_vector_store_result` mock branch** — sets `self.client = "mock"` and `self.collection = "mock"`; `is_empty()` and `add_documents()` both check for this and return early.
+- **`test_rag_integration.py::test_rag_integration`** — already PASSES (fixed incidentally by `public_api_migration_and_ui_polish_20260615` Phase 2 follow-up commit `26e1b652`).
+
+### 1.4 Investigation Clues
+
+The error pattern `"'NoneType' object has no attribute 'get'"` is a specific Python error indicating a `dict.get()` call on a `None` value. The most likely candidates in the RAG sync path:
+
+1. **`src/app_controller.py:1469` — `engine = rag_engine.RAGEngine(self.rag_config, self.active_project_root)`** — if `self.active_project_root` is `None` or the `RAGConfig` has a `None` sub-field.
+   - **Status:** `active_project_root` is a property that returns `str(Path(self.active_project_path).parent)` or `self.ui_files_base_dir`. The test sets `files_base_dir` to a valid path.
+   - **Status:** `RAGConfig()` default has all required fields populated.
+
+2. **`src/rag_engine.py:89-101` — `RAGEngine.__init__`** — calls `_init_embedding_provider()` and `_init_vector_store_result()`. With `vector_store.provider='mock'`, the latter should return `Result(data=None)` (success).
+   - **Status:** Verified by direct instantiation: the engine constructs successfully.
+
+3. **`src/rag_engine.py:111-128` — `_init_vector_store_result`** — the `'chroma'` branch calls `_validate_collection_dim_result()` (line 122) which calls `self.collection.get(limit=1, include=["embeddings"])` (line 146) then `res.get("embeddings")` (line 149). If `self.collection` is set but the chromadb call returns a non-dict (e.g. a `Result` object), `.get()` would fail with NoneType.
+   - **Status:** This is the most likely candidate. The `is_empty()` and `add_documents()` short-circuit on the mock string, but the `_init_vector_store_result` for the `'mock'` branch returns immediately with `Result(data=None)` (line 126) — so the chromadb validation is skipped. So this isn't the bug for the 'mock' case.
+   - **Status:** For the 'chroma' case (test_rag_phase4_stress uses 'chroma'), the validation runs. If `self.embedding_provider.embed(["__rag_dim_check__"])` fails (e.g. due to gemini client not being initialized in the test subprocess), the error could be different. But the test_rag_phase4_stress uses `rag_emb_provider='local'` which depends on `sentence_transformers`.
+
+4. **`src/app_controller.py:230` — `controller.rag_engine and controller.rag_config and controller.rag_config.enabled`** — this is the entry check; if any of these is None, the sync is skipped.
+   - **Status:** `self.rag_config` is set in `__init__` (line 1830-1831) and reset in `reset_session` (line 3387). Should never be None after init.
+
+5. **A more subtle cause:** the `submit_io` lambda in `src/app_controller.py:1457` (`self.submit_io(lambda: self._do_rag_sync(token))`) submits a lambda. If the IO pool is shared with the user-agent / MMA comms callbacks, an unrelated exception in a different task could leak into the RAG status.
+   - **Status:** Low likelihood, but worth checking.
+
+The implementer MUST use TDD red-first: add a focused test that reproduces the error with minimal setup, then trace the call chain to find the actual `.get(None)` call. The audit above is a starting point, not a definitive diagnosis.
+
+---
+
+## 2. Goals
+
+### 2.1 Functional Goals
+
+| ID | Goal | Acceptance Criterion |
+|---|---|---|
+| **G1** | Investigate the RAG sync NoneType.get error | A focused regression test reproduces the error with `rag_enabled=True` + `rag_source='mock'` setup |
+| **G2** | Fix the underlying bug | The 3 RAG tests pass after the fix; no regression in the 12 RAG-related tests that already pass |
+| **G3** | Add a defensive guard or proper error message | If a config field is unexpectedly None, the error message identifies WHICH field is None (so future debug is easier) |
+| **G4** | Update `docs/guide_rag.md` to document the fix | The relevant guide has a "Known issues" or "Troubleshooting" section if appropriate |
+
+### 2.2 Non-Functional Goals
+
+| ID | Goal | Acceptance Criterion |
+|---|---|---|
+| **NF1** | Zero new regressions | `uv run pytest tests/` shows 3 fewer failures than pre-track baseline; no new failures |
+| **NF2** | Per-task atomic commits | 1-3 atomic commits with clear messages |
+| **NF3** | 1-space indentation, no comments, type hints preserved | `uv run python -c "import ast; ast.parse(open('src/app_controller.py').read())"` succeeds |
+| **NF4** | Per-commit git notes | All commits have git notes summarizing the fix |
+
+---
+
+## 3. Per-File Design
+
+### 3.1 Investigation: Reproduce the error in isolation
+
+The first task is a TDD red. The implementer should write a test that reproduces the error with minimal setup.
+
+**Recommended test file:** `tests/test_rag_sync_none_error.py` (new file)
+
+**The test pattern:**
+```python
+def test_rag_sync_does_not_fail_with_none_error(controller_with_rag_enabled):
+    # controller_with_rag_enabled: a fixture that:
+    #   - Creates an AppController
+    #   - Sets rag_enabled=True, rag_source='mock', files_base_dir=tmp_path
+    #   - Submits the sync
+    #   - Waits for the sync to complete (poll _rag_sync_dirty or rag_status)
+    status = controller.rag_status
+    assert "error" not in status, f"RAG sync failed unexpectedly: {status}"
+    # OR
+    assert status == "ready", f"Expected 'ready', got: {status}"
+```
+
+**The diagnostic step:**
+1. Run the test; capture the full error message
+2. Add a `sys.stderr.write` traceback capture in the except clause at `src/app_controller.py:1479`
+3. Find the actual line where the `.get()` is called on None
+4. **Document the root cause** in the commit message (so the fix is traceable)
+
+### 3.2 The fix
+
+The fix depends on what the investigation finds. Three likely scenarios:
+
+**Scenario A: A config field is None** (most likely)
+- **Example:** If `self.rag_config.embedding_provider` is somehow `None` when the setter for `rag_source` is called, the engine init would fail.
+- **Fix:** Add a guard in the setter: `if not self.rag_config: return` and a fallback in the engine init: `if self.config.embedding_provider is None: raise ValueError("embedding_provider must be set before rag_enabled")`.
+- **Files affected:** `src/rag_engine.py`, possibly `src/app_controller.py`
+
+**Scenario B: A dict access is failing on a ChromaDB response**
+- **Example:** `_validate_collection_dim_result` line 149: `embeddings = res.get("embeddings") if isinstance(res, dict) else None`. If chromadb returns a different object type, the `.get()` is skipped (None is returned) but the call downstream may fail.
+- **Fix:** Add more defensive guards or correct the type check.
+- **Files affected:** `src/rag_engine.py`
+
+**Scenario C: A side effect of a previous test (subprocess state pollution)**
+- **Example:** A prior test in the live_gui subprocess left the RAG config in a bad state.
+- **Fix:** Reset the RAG config in the test's `setup` or use `live_gui.reset_session()`.
+- **Files affected:** The test (no production code change)
+
+**The implementer MUST** follow the TDD protocol: write the reproducing test, run it, observe the failure, trace the root cause, fix it, run the test again, verify all 3 RAG tests pass.
+
+### 3.3 Test verification
+
+After the fix:
+- The 3 RAG tests pass in isolation
+- The 3 RAG tests pass in batched run (`scripts/run_tests_batched.py`)
+- The full test suite has 1285 pass (was 1282) + 4 skip + 0 fail (was 3)
+- No regression in `test_rag_engine.py` (9+ tests), `test_rag_engine_result.py`, `test_rag_engine_ready_status_bug.py`, `test_rag_gui_presence.py`, `test_rag_integration.py`, `test_sync_rag_engine_coalescing.py`, `test_rag_phase4_stress.py` (after the fix)
+
+### 3.4 Documentation
+
+Update `docs/guide_rag.md` (if it exists; check first) with:
+- A short note about the fix (1 paragraph)
+- A troubleshooting entry if the error is likely to recur: "If `rag_status` shows `'NoneType' object has no attribute 'get'`, check that `rag_config.embedding_provider` is set before `rag_enabled`."
+
+If `docs/guide_rag.md` does not exist, no new doc is needed (the per-source-file guide is the wrong place for this; the test file's docstring or the commit message is sufficient).
+
+---
+
+## 4. Architecture Reference
+
+### 4.1 The RAG sync pipeline
+
+The RAG sync is initiated when any of the RAG-related setters is called (`rag_enabled`, `rag_source`, `rag_emb_provider`, `rag_chunk_size`, `rag_chunk_overlap`, etc.):
+
+```
+[Set rag_* property] -> [setter calls _sync_rag_engine()] -> [token + dirty flag update]
+                                                                    |
+                                                                    v
+                                          [submit_io(_do_rag_sync(token))] -> [IO pool worker]
+                                                                                     |
+                                                                                     v
+                                                                            [_do_rag_sync body]
+                                                                                     |
+                                                                                     v
+                                           [RAGEngine(config, base_dir) construction]
+                                                                                     |
+                                                                                     v
+                                  [if engine.is_empty() and self.files -> _rebuild_rag_index()]
+                                                                                     |
+                                                                                     v
+                                                        [set _set_rag_status("ready" | "error: ...")]
+```
+
+### 4.2 The mock branch
+
+The `RAGConfig().vector_store.provider` defaults to `'mock'`. When the engine init hits this branch:
+
+```python
+elif vs_config.provider == 'mock':
+    self.client = "mock"
+    self.collection = "mock"
+    return Result(data=None)
+```
+
+The engine is "empty" (`is_empty()` returns `True` for mock). `_rebuild_rag_index` is NOT called. The status should be "ready" immediately.
+
+### 4.3 The coalescing pattern
+
+The `token + dirty flag` pattern in `_sync_rag_engine` ensures that N rapid setter calls produce ONE sync, not N parallel syncs. This is the pattern from `test_infrastructure_hardening_20260609` track. The token check at line 1463 short-circuits superseded syncs.
+
+### 4.4 The status update mechanism
+
+`self._set_rag_status(status)` appends a task to `_pending_gui_tasks`. The GUI render loop processes the queue and updates the `rag_status` field. The test polls `client.get_value('rag_status')` to wait for the update.
+
+---
+
+## 5. Test Plan
+
+### 5.1 Per-phase test verification
+
+| Phase | Test command | Expected |
+|---|---|---|
+| 1 | `uv run pytest tests/test_rag_phase4_final_verify.py tests/test_rag_phase4_stress.py tests/test_rag_visual_sim.py -v 2>&1 \| tee tests/artifacts/rag_track_phase1_red.log` | 3/3 fail with the NoneType.get error |
+| 2 | (after fix) `uv run pytest tests/test_rag_phase4_final_verify.py tests/test_rag_phase4_stress.py tests/test_rag_visual_sim.py -v 2>&1 \| tee tests/artifacts/rag_track_phase2_green.log` | 3/3 pass |
+| 3 | (full suite) `uv run pytest tests/ 2>&1 \| tee tests/artifacts/rag_track_phase3_full.log` | 1285 pass + 4 skip + 0 fail |
+| 4 | (batched) `uv run .\scripts\run_tests_batched.py 2>&1 \| tee tests/artifacts/rag_track_phase4_batched.log` | All tiers PASS; no failures |
+
+### 5.2 TDD red verification
+
+For each new test or fix:
+1. Verify the test FAILS as expected (red phase)
+2. Implement the fix
+3. Verify the test PASSES (green phase)
+4. Verify no regression in the previously-passing tests
+5. Commit
+
+**Anti-pattern guard:** per `AGENTS.md` "Critical Anti-Patterns", no skipping tests just because they fail. The 3 RAG tests are the actual problem to solve; the implementer must find and fix the root cause.
+
+### 5.3 The diagnostic strategy
+
+If the implementer can't find the bug from the error message alone:
+1. Add `import traceback; sys.stderr.write(traceback.format_exc())` to the except clause in `src/app_controller.py:1479-1482`
+2. Run the test; capture the full traceback
+3. Find the actual `.get(None)` call
+4. **Document the traceback in the commit message** (so the fix is traceable)
+5. Remove the diag traceback after the fix is verified
+
+---
+
+## 6. Migration Strategy
+
+This is a small bug-fix track. The phases are simple:
+
+1. **Phase 1: Investigation + reproducing test (1-2 hours)**
+2. **Phase 2: Fix (1-3 hours)**
+3. **Phase 3: Full test suite + batched verification (30 min)**
+4. **Phase 4: Docs update (15 min)**
+5. **Phase 5: Metadata + tracks.md (15 min)**
+
+The order doesn't matter much (it's all one fix); the implementer can iterate between Phase 1 and 2 as needed.
+
+---
+
+## 7. Out of Scope
+
+### 7.1 Deferred to separate tracks
+
+| ID | Item | Defer to | Why |
+|---|---|---|---|
+| OOS1 | The `send_result` → `send` mass rename (user's stated intent) | User's manual refactor after this track | The user wants to do this themselves. The Result API is stable; only the function name changes. |
+| OOS2 | 23 lower-impact files with weak types (per `data_structure_strengthening_20260606/spec.md` §1 line 20) | `data_structure_strengthening_20260606` (the next major track) | That's the data_structure track's scope. |
+| OOS3 | `live_gui_mock_injection_20260615` infrastructure | Separate infrastructure track | Not blocking. Recommended but not required. |
+| OOS4 | The full RAG test cleanup (e.g., removing `time.sleep(0.5)` patterns in favor of poll loops) | Separate RAG test quality track | The tests are functional; this is a test-quality improvement, not a bug fix. |
+| OOS5 | The Gemini CLI thinking-format path | Defer to `doeh_test_thinking_cleanup_20260615` follow-up | Not in this track's scope. |
+| OOS6 | The `RAGConfig` data structure improvements (e.g., nested validation) | `data_structure_strengthening_20260606` | Not blocking the bug fix. |
+
+### 7.2 Explicitly NOT in this track
+
+- The user wants to do a `send_result` → `send` mass rename after this track. **Do not** do it in this track. The bug fix is for RAG only.
+- A general RAG test quality cleanup (poll loops, error message improvements, etc.) — out of scope; only fix the specific bug.
+- The `_rebuild_rag_index` method's complex error handling — out of scope; only fix the specific bug.
+
+---
+
+## 8. Risks & Mitigations
+
+| ID | Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|---|
+| **R1** | The fix breaks an unrelated test | Low | Medium | Run the full test suite in Phase 3 + the batched test in Phase 4. If a new failure appears, STOP and report. |
+| **R2** | The bug is in a hard-to-reach code path (deep in IO pool worker) | Medium | Medium | Add diagnostic traceback in the except clause; capture the actual error site; document in the commit message. |
+| **R3** | The fix is in the test (subprocess state pollution) not the production code | Low | Low | If the fix is in the test, document this in the commit message. Consider adding a teardown reset in the test. |
+| **R4** | The fix introduces a regression in `test_rag_engine_ready_status_bug.py` | Low | Medium | Run the full RAG test suite after the fix. |
+| **R5** | The implementation takes longer than estimated (1 day) | Low | Low | This is a 0.5-1 day track; even 2 days is acceptable. The user's overall plan is to do 2 more tracks (this + a `send_result` → `send` rename) before the data structure track. |
+
+---
+
+## 9. Verification Criteria (definition of "done")
+
+The track is DONE when **ALL** of the following are true:
+
+1. **G1: A reproducing test exists** that fails before the fix
+2. **G2: All 3 RAG tests pass** (test_rag_phase4_final_verify, test_rag_phase4_stress, test_rag_visual_sim)
+3. **G3: A defensive guard or proper error message** is added (so future debug is easier)
+4. **G4: docs/guide_rag.md** updated (if it exists)
+5. **NF1: No new regressions** in the full test suite (1285 pass + 4 skip + 0 fail)
+6. **NF2: Per-task atomic commits** (1-3 commits total)
+7. **NF3: 1-space indentation + no comments + type hints preserved**
+8. **NF4: Per-commit git notes** attached
+
+**Test count math:**
+- Pre-track baseline: 1282 pass + 4 skip + 3 fail
+- After this track: 1285 pass + 4 skip + 0 fail (3 newly-passing)
+- This is the FIRST time the project is fully green since `data_oriented_error_handling_20260606` shipped on 2026-06-12.
+
+---
+
+## 10. Execution Order & Dependencies
+
+**No external blockers.** This track can start immediately after the Tier 1 review approves the spec.
+
+**Execution order (the plan):**
+1. Phase 1: Investigation + reproducing test (1-2 hours)
+2. Phase 2: Fix (1-3 hours)
+3. Phase 3: Full test suite + batched verification (30 min)
+4. Phase 4: Docs update (15 min)
+5. Phase 5: Metadata + tracks.md (15 min)
+
+**Total:** 0.5-1 day Tier 2 work (4-8 hours)
+
+**Followed by:** the user can do the `send_result` → `send` mass rename themselves, then start `data_structure_strengthening_20260606` track.
+
+---
+
+## 11. References
+
+### Architecture docs
+- `docs/guide_rag.md` (if it exists) — RAG subsystem architecture
+- `docs/guide_app_controller.md` — the `AppController._do_rag_sync` method is the entry point
+- `docs/guide_testing.md` — `live_gui` fixture + structural testing contract
+
+### Styleguides
+- `conductor/code_styleguides/error_handling.md` — `Result[T]` pattern (used by `RAGEngine._init_vector_store_result`)
+- `conductor/code_styleguides/data_oriented_design.md` — the canonical DOD reference
+
+### Source code (the relevant lines)
+- `src/app_controller.py:1451-1488` — `_sync_rag_engine` and `_do_rag_sync` (the entry points)
+- `src/app_controller.py:1490-1497` — `rag_enabled` property + setter (triggers the sync)
+- `src/app_controller.py:3016-3023` — `_set_rag_status` (sets the error status)
+- `src/app_controller.py:3025-3056` — `_rebuild_rag_index` (the second worker)
+- `src/rag_engine.py:88-128` — `RAGEngine.__init__` and `_init_vector_store_result`
+- `src/rag_engine.py:130-166` — `_validate_collection_dim_result` (the most likely `.get()` call site)
+- `src/models.py:1039-1065` — `RAGConfig` and `VectorStoreConfig`
+
+### Parent tracks
+- `conductor/tracks/data_oriented_error_handling_20260606/spec.md` §12.1 — the follow-up scope that included RAG fixes
+- `conductor/tracks/public_api_migration_and_ui_polish_20260615/spec.md` — the parent track that documented 4 RAG failures remaining (1 was inadvertently fixed)
+- `docs/reports/TRACK_COMPLETION_public_api_migration_and_ui_polish_20260615.md` §3 deviation #2.3 — the `test_rag_integration.py` fix (commit 26e1b652)
+
+### Test files (the 3 to fix)
+- `tests/test_rag_phase4_final_verify.py::test_phase4_final_verify` (tier-3 live_gui)
+- `tests/test_rag_phase4_stress.py::test_rag_large_codebase_verification_sim` (tier-3 live_gui)
+- `tests/test_rag_visual_sim.py::test_rag_full_lifecycle_sim` (tier-3 live_gui)
+
+### Already-passing RAG tests (do NOT regress)
+- `tests/test_rag_engine.py` (8+ tests)
+- `tests/test_rag_engine_result.py` (3+ tests)
+- `tests/test_rag_engine_ready_status_bug.py` (3+ tests)
+- `tests/test_rag_gui_presence.py` (2 tests)
+- `tests/test_rag_integration.py::test_rag_integration` (1 test; was failing pre-public_api, fixed by commit 26e1b652)
+- `tests/test_sync_rag_engine_coalescing.py` (4+ tests)
+
+### User's stated intent (after this track)
+- `send_result` → `send` mass rename (user will do manually)
+- Then `data_structure_strengthening_20260606` track