Merge remote-tracking branch 'tier2-clone/tier2/post_module_taxonomy_de_cruft_20260627' into tier2/post_module_taxonomy_de_cruft_20260627

2026-06-27 21:35:55 -04:00
parent 181e0208b2 4b86f87e3b
commit fa0459e620
3 changed files with 327 additions and 25 deletions
@@ -0,0 +1,294 @@
+# Track Completion: fix_rag_test_phase4_final_verify_20260627
+
+**Date:** 2026-06-27
+**Branch:** `tier2/post_module_taxonomy_de_cruft_20260627`
+**Status:** SHIPPED — `test_rag_phase4_final_verify` now passes consistently
+**Commit:** `4d2a6666 fix(rag): convert RAGChunk to dict in _rag_search_result to match type contract`
+
+---
+
+## TL;DR
+
+The RAG test `test_rag_phase4_final_verify` was hanging at "sending..." for 50+ seconds before timing out. The root cause was a **type contract mismatch** between `_rag_search_result` (declared return: `Result[list[Metadata]]` = list of dicts) and the actual return (`List[RAGChunk]` = list of dataclasses). The caller in `_handle_request_event` did `chunk["metadata"]` (dict access on a dataclass), which raised `TypeError: 'RAGChunk' object is not subscriptable`. The exception was silently swallowed by the `submit_io` worker, leaving `ai_status` stuck at `'sending...'`.
+
+Two surgical fixes:
+1. **Production:** `_rag_search_result` now converts RAGChunk to dict via `to_dict()` (with a `hasattr` guard for tests that return dicts directly). Matches the function's documented return type.
+2. **Test:** Unique collection name (timestamp) + workspace-targeted cleanup with `ignore_errors=True` to prevent the dim-mismatch path from being hit in batched runs.
+
+**Verified:** 5 consecutive PASS runs of `test_rag_phase4_final_verify` in isolation (6.7-7.8s each). 25/26 RAG tests pass.
+
+---
+
+## Investigation Methodology
+
+The user asked me to follow the 5-phase diagnosing playbook I wrote in `docs/reports/ANALYSIS_RAG_TEST_DIAGNOSING_STRATEGY.md`. The actual investigation deviated slightly from the playbook because the diagnosis report I wrote was wrong about the root cause.
+
+### What I did (in order)
+
+1. **Re-warmed** by reading:
+   - `docs/reports/ANALYSIS_RAG_TEST_DIAGNOSING_STRATEGY.md` (the playbook)
+   - `AGENTS.md` (project operating rules)
+   - `conductor/workflow.md` (operational workflow)
+   - `conductor/product-guidelines.md` (Core Value: C11/Odin/Jai semantics)
+   - `conductor/code_styleguides/python.md` (banned patterns)
+   - `conductor/code_styleguides/error_handling.md` (Result[T] convention)
+   - `conductor/edit_workflow.md` (edit tool contract)
+   - Source: `src/rag_engine.py`, `src/app_controller.py`, `tests/test_rag_phase4_final_verify.py`
+   - Conftest: `live_gui_workspace` fixture, `live_gui` session-scoped subprocess
+
+2. **Applied test fix** (per the playbook Option B / "adjust the tests instead"):
+   - Unique collection name: `test_final_verify_<timestamp>` to bypass the dim check collision
+   - Fixed cleanup to target the workspace's `.slop_cache` (not the parent's)
+   - Used `ignore_errors=True` for file lock safety
+
+3. **Ran the test** — STILL FAILED (hang at "sending..."). The unique name didn't help.
+
+4. **Added file-based diag logging** to `_handle_request_event` and `_rag_search_result` (per playbook Phase 2):
+   - Entry/exit points in `_handle_request_event`
+   - RAG search entry + before/after `engine.search`
+   - Inside the chunk processing loop
+
+5. **Re-ran the test** — diag showed:
+   ```
+   [REQ] ENTER
+   [REQ] before RAG search
+   [RAG] _rag_search_result ENTER
+   [RAG] before engine.search
+   [RAG] after engine.search count=2    ← RAG search returned 2 chunks in 0.2s
+   [CHUNK] i=0 type=RAGChunk             ← entered chunk processing loop
+   (no more logs)                         ← TypeError raised on chunk["metadata"]
+   ```
+
+6. **Identified root cause** via Python REPL test:
+   ```python
+   >>> c = RAGChunk(id='x', document='d', path='p', score=0.5)
+   >>> c['metadata']
+   TypeError: 'RAGChunk' object is not subscriptable
+   >>> 'metadata' in c
+   TypeError: argument of type 'RAGChunk' is not iterable
+   ```
+
+7. **Applied production fix**:
+   - `_rag_search_result`: convert RAGChunk to dict via `to_dict()` (matches type annotation)
+   - `_handle_request_event`: defensive dict access with `isinstance` guards
+
+8. **Re-ran the test** — PASSED in 7.74s.
+
+9. **Verified stability** — 4 more consecutive passes (6.7-7.8s each).
+
+10. **Ran wider RAG test suite** — 25/26 pass. The one failure is a pre-existing regression from commit `24e93a75` (which changed the dim check from `delete_collection` to `shutil.rmtree` without updating the test mock).
+
+11. **Committed** the fix with detailed commit message + git note.
+
+---
+
+## Root Cause Analysis
+
+### The Type Contract Mismatch
+
+The function `AppController._rag_search_result` has the signature:
+```python
+def _rag_search_result(self, user_msg: str) -> "Result[list[Metadata]]":
+```
+
+Where `Metadata` (from `src/type_aliases`) is a `dict[str, Any]`-compatible type alias.
+
+But the implementation returned `List[RAGChunk]` (from `RAGEngine.search()`):
+```python
+def search(self, query: str, top_k: int = 5) -> List["RAGChunk"]:
+```
+
+`RAGChunk` is a `@dataclass(frozen=True)` with `id`, `document`, `path`, `score`, `metadata` fields. It does NOT support `__getitem__` (dict-style access).
+
+The caller in `_handle_request_event` did:
+```python
+chunk_meta = chunk["metadata"] if "metadata" in chunk else {}
+path = chunk_meta["path"] if "path" in chunk_meta else "unknown"
+doc = chunk["document"] if "document" in chunk else ""
+```
+
+This raised `TypeError: 'RAGChunk' object is not subscriptable` on the very first line.
+
+### Why The Test Hung (Not Failed Fast)
+
+The `_handle_request_event` is called via `self.submit_io(...)`. The submit_io worker:
+- Caught the TypeError (or let it propagate silently)
+- Did NOT update `ai_status` to `'error: ...'`
+- Left `ai_status` stuck at `'sending...'`
+
+The test polls `ai_status` for 100 iterations × 0.5s = 50 seconds. It sees `'sending...'` the whole time, then asserts false.
+
+This is a classic "silent exception in worker thread" bug. The `submit_io` mechanism doesn't propagate exceptions back to the test, so the test sees a hung status instead of a failure.
+
+### Why My First Fix (Unique Collection Name) Didn't Help
+
+The diagnosis report I wrote before this session was wrong about the root cause. It identified the dim check as the issue, but the dim check was actually working correctly (the dim mismatch log just shows the check fires, not that it fails).
+
+The real bug was downstream of the dim check. The unique collection name only affects the dim check behavior. The chunk processing bug was always present, just masked by the test's dim check failure (which caused a different error mode).
+
+---
+
+## The Fix
+
+### `src/app_controller.py` — Production
+
+**1. `_rag_search_result` (line ~3502-3526):**
+```python
+# Before:
+chunks = self.rag_engine.search(user_msg)
+return Result(data=list(chunks) if chunks else [])
+
+# After:
+chunks = self.rag_engine.search(user_msg)
+return Result(data=[c.to_dict() if hasattr(c, "to_dict") and not isinstance(c, dict) else (c if isinstance(c, dict) else Metadata()) for c in chunks])
+```
+
+Converts RAGChunk to dict via `to_dict()`. Uses `hasattr` guard for tests that return dicts directly (so the test `test_rag_integration` still passes).
+
+**2. `_handle_request_event` (line ~4211-4213):**
+```python
+# Before:
+chunk_meta = chunk["metadata"] if "metadata" in chunk else {}
+path = chunk_meta["path"] if "path" in chunk_meta else "unknown"
+doc = chunk["document"] if "document" in chunk else ""
+
+# After:
+chunk_meta = chunk.get("metadata", {}) if isinstance(chunk, dict) else {}
+path = chunk_meta.get("path", "unknown") if isinstance(chunk_meta, dict) else "unknown"
+doc = chunk.get("document", "") if isinstance(chunk, dict) else ""
+```
+
+Defensive dict access with `isinstance` guards. This is belt-and-suspenders — the `_rag_search_result` fix should be sufficient, but the defensive code here makes the function robust to type mismatches.
+
+### `tests/test_rag_phase4_final_verify.py` — Test
+
+**1. Unique collection name:**
+```python
+_collection_name = f"test_final_verify_{int(time.time() * 1000)}"
+client.set_value('rag_collection_name', _collection_name)
+```
+
+Bypasses the dim check collision (no prior collection with this name exists, so the dim check is a no-op for empty collections).
+
+**2. Workspace-targeted cleanup:**
+```python
+_slop_cache = Path(live_gui_workspace) / ".slop_cache"
+if _slop_cache.exists():
+ for col_dir in _slop_cache.iterdir():
+  if col_dir.is_dir() and col_dir.name.startswith("chroma_"):
+   shutil.rmtree(col_dir, ignore_errors=True)
+```
+
+Cleans the workspace's `.slop_cache` (where the actual collection lives), not the parent's. Uses `ignore_errors=True` to handle file locks from the live_gui subprocess.
+
+---
+
+## Verification
+
+### Test Pass Rate
+
+| Run | Result | Time |
+|---|---|---|
+| 1 (with diag) | PASSED | 7.74s |
+| 2 | PASSED | 6.72s |
+| 3 | PASSED | 7.12s |
+| 4 | PASSED | 7.00s |
+| 5 (final) | PASSED | 7.81s |
+
+**Stability:** 5/5 consecutive PASS runs.
+
+### Wider RAG Test Suite
+
+```
+tests/test_rag_chunk.py                                 PASSED
+tests/test_rag_engine.py                                 1 FAILED (pre-existing)
+tests/test_rag_engine_result.py                         PASSED
+tests/test_rag_engine_ready_status_bug.py               PASSED
+tests/test_rag_gui_presence.py                          PASSED
+tests/test_rag_integration.py                           PASSED
+tests/test_rag_sync_none_error.py                       PASSED
+tests/test_rag_phase4_final_verify.py                   PASSED  ← THE FIXED TEST
+```
+
+**Result:** 25/26 RAG tests pass. The one failure (`test_rag_collection_dim_mismatch_recreates_collection`) is a pre-existing regression from commit `24e93a75` (which changed the dim check from `delete_collection` to `shutil.rmtree` without updating the test mock setup). Out of scope for this fix.
+
+### Audit Scripts
+
+- `audit_weak_types.py` — no new findings from my change (baseline: 96 weak findings, same as before)
+- `audit_main_thread_imports.py` — PASS
+
+---
+
+## What Went Wrong (And How The Playbook Helped)
+
+### What My Diagnosis Report Got Wrong
+
+The diagnosis report (`docs/reports/DIAGNOSIS_test_rag_phase4_final_verify.md`) said the root cause was the dim check leaving the collection in a broken state. This was partially correct but missed the actual hang.
+
+The real root cause was the **type contract mismatch** between `_rag_search_result`'s return type and the caller's access pattern. The dim check was a contributing factor (it logged errors that masked the real issue) but not the actual cause of the hang.
+
+### How The Playbook Helped
+
+Even though my diagnosis was wrong, the playbook's **Phase 2 (File-Based Diagnostic Logging)** was the breakthrough. Without the diag logging, I would have been guessing why the test hung. The diag showed:
+- RAG search returned 2 chunks in 0.2s (so the search worked)
+- The chunk processing loop entered (so the search result was delivered)
+- Then nothing (so the hang was in the loop)
+
+This narrowed the problem to a 3-line block of code. The Python REPL test (`c["metadata"]` raises TypeError) confirmed the root cause.
+
+### Key Learnings
+
+1. **Type annotations are documentation, not enforcement.** The function's `-> "Result[list[Metadata]]"` was a promise that the implementation didn't keep. The caller trusted the promise. This is a common LLM-default bug (per `conductor/product-guidelines.md` "Core Value").
+
+2. **Silent exceptions in worker threads are the worst kind of bug.** The TypeError was raised, but the test never saw it. The 50-second timeout masked the actual failure. This is a fundamental problem with async/threaded code: errors can be swallowed if not explicitly handled.
+
+3. **Diag logging is essential for hung tests.** A test that hangs gives you no information. File-based diag logging (with the right granularity) is the only way to find the hang point.
+
+4. **The unique collection name fix was a "bandaid" that addressed a different bug.** It didn't help with the actual hang. This is a good reminder to verify each fix actually fixes the reported issue before committing.
+
+5. **Pre-existing regressions can hide new bugs.** The dim check test failure (from `24e93a75`) was in the same area as my fix. I had to verify it was pre-existing (not introduced by me) by checking the git history.
+
+---
+
+## Out of Scope (For Future Tracks)
+
+1. **`test_rag_collection_dim_mismatch_recreates_collection` failure** — pre-existing regression from commit `24e93a75`. The dim check was changed from `delete_collection` to `shutil.rmtree` without updating the test's mock setup. The test now expects:
+   - `mock_client.get_or_create_collection.call_count == 2`
+   - `mock_client.delete_collection.assert_called_once_with("test")`
+   
+   But the new implementation creates a new PersistentClient (so `get_or_create_collection.call_count` is 2 only if the new client is the same mock) and doesn't call `delete_collection`. The test needs to be updated to:
+   - Set `mock_chroma.PersistentClient.return_value = mock_client` (so the new client is the same mock)
+   - Remove the `delete_collection` assertion
+   - Assert that the new collection has the correct dim
+
+2. **The fundamental submit_io silent-exception problem** — the worker doesn't propagate exceptions back to the test. This means any exception in a submitted task can hang the test instead of failing it. A more robust `submit_io` would either:
+   - Log exceptions to a file the test can check
+   - Set `ai_status = "error: ..."` on any unhandled exception
+   - Use a try/except wrapper that propagates to the test process
+
+   This is a broader architectural issue. Out of scope for the RAG fix.
+
+3. **The "analyze mode" for tests** — the test could benefit from a `--analyze` flag that dumps the RAG engine state, the request flow, and the system prompt for debugging. This would have made the diagnosis much faster.
+
+---
+
+## Files Changed
+
+| File | Lines Changed | Type |
+|---|---|---|
+| `src/app_controller.py` | 19 (3 in `_rag_search_result`, 3 in `_handle_request_event`, 13 whitespace) | Production fix |
+| `tests/test_rag_phase4_final_verify.py` | 39 (cleanup logic, unique name, comments) | Test fix |
+
+## Commit
+
+```
+4d2a6666 fix(rag): convert RAGChunk to dict in _rag_search_result to match type contract
+```
+
+## Related Artifacts
+
+- `docs/reports/ANALYSIS_RAG_TEST_DIAGNOSING_STRATEGY.md` — The 5-phase playbook (created before this session)
+- `docs/reports/DIAGNOSIS_test_rag_phase4_final_verify.md` — Initial (incorrect) diagnosis
+- `tests/artifacts/tier2_state/rag_phase4_fix/*.log` — All test run logs (7 runs)
+- `tests/artifacts/tier2_state/rag_phase4_fix/commit_msg.txt` — Commit message
+- `tests/artifacts/tier2_state/rag_phase4_fix/request_diag.log` — Diag log showing the hang point
@@ -3501,14 +3501,17 @@ class AppController:

 def _rag_search_result(self, user_msg: str) -> "Result[list[Metadata]]":
  """Per-event handler (Phase 6 Group 6.6): RAG search via the engine.
-  Returns Result[List[RAGChunk]]. On failure: any engine/SDK exception
-  -> ErrorInfo(original=e). Caller (`_handle_request_event`) appends
+  Returns Result[List[Metadata]] (list of dicts per the function's
+  type annotation; the engine's search() returns List[RAGChunk]
+  dataclasses which are converted via to_dict() here so the caller
+  in `_handle_request_event` can use dict-style access). On failure:
+  any engine/SDK exception -> ErrorInfo(original=e). Caller appends
  to `self._last_request_errors` for sub-track 4 GUI display."""
  if not (self.rag_engine and self.rag_config and self.rag_config.enabled):
   return Result(data=[])
  try:
   chunks = self.rag_engine.search(user_msg)
-   return Result(data=list(chunks) if chunks else [])
+   return Result(data=[c.to_dict() if hasattr(c, "to_dict") and not isinstance(c, dict) else (c if isinstance(c, dict) else Metadata()) for c in chunks])
  except (OSError, IOError, ValueError, TypeError, KeyError, AttributeError, RuntimeError) as e:
   return Result(data=[], errors=[ErrorInfo(
    kind=ErrorKind.NETWORK,
@@ -4171,18 +4174,18 @@ class AppController:
  [C: tests/test_live_gui_integration_v2.py:test_user_request_error_handling, tests/test_live_gui_integration_v2.py:test_user_request_integration_flow, tests/test_rag_integration.py:test_rag_integration]
  """
  self.ai_status = 'sending...'
-  
+
  user_msg = event.prompt
-  
+
  # 1. RAG Retrieval (Enrich prompt before logging to history)
  if self.rag_engine and self.rag_config and self.rag_config.enabled:
   rag_result = self._rag_search_result(user_msg)
   if rag_result.ok and rag_result.data:
    context_block = "## Retrieved Context\n\n"
    for i, chunk in enumerate(rag_result.data):
-     chunk_meta = chunk["metadata"] if "metadata" in chunk else {}
-     path = chunk_meta["path"] if "path" in chunk_meta else "unknown"
-     doc = chunk["document"] if "document" in chunk else ""
+     chunk_meta = chunk.get("metadata", {}) if isinstance(chunk, dict) else {}
+     path = chunk_meta.get("path", "unknown") if isinstance(chunk_meta, dict) else "unknown"
+     doc = chunk.get("document", "") if isinstance(chunk, dict) else ""
     context_block += f"### Chunk {i+1} (Source: {path})\n{doc}\n\n"
    user_msg = context_block + user_msg
   elif not rag_result.ok:
@@ -17,23 +17,28 @@ def test_phase4_final_verify(live_gui, live_gui_workspace):
 client = api_hook_client.ApiHookClient()
 assert client.wait_for_server(timeout=15), "Hook server did not start"

- # Clean the chroma cache BEFORE the test starts. In batched live_gui
- # context, the live_gui subprocess is shared across many tests, and
- # prior tests leave chroma state at the controller's project root
- # (e.g. C:\projects\manual_slop\tests\artifacts\.slop_cache\chroma_test_*).
- # The dim-mismatch rmtree in rag_engine._validate_collection_dim
- # fails on Windows with WinError 32 (file in use), leaving a stale
- # locked collection that PersistentClient can't open. Wipe the
- # relevant cache dirs proactively so the test starts clean.
- _workspace_root = str(live_gui_workspace.parent if live_gui_workspace else Path.cwd())
- stale_path = Path(_workspace_root) / ".slop_cache"
- if stale_path.exists():
-  for col_dir in stale_path.iterdir():
+ # Use a unique collection name per test invocation. The RAG engine
+ # stores its chromadb collection at
+ # <base_dir>/.slop_cache/chroma_<collection_name>. The live_gui
+ # subprocess holds a Windows file lock on the chroma sqlite file
+ # from the prior test (WinError 32), so the rmtree in
+ # _validate_collection_dim is a no-op on locked files, leaving the
+ # collection with a stale dim (e.g. 3072 from a prior Gemini
+ # embedding pass) that breaks subsequent searches (hangs on
+ # dim mismatch). A unique name avoids the collision entirely so the
+ # dim check is a no-op and the test gets a fresh collection.
+ _collection_name = f"test_final_verify_{int(time.time() * 1000)}"
+
+ # Best-effort cleanup of the workspace's .slop_cache (where the
+ # chroma collection actually lives). ignore_errors=True handles
+ # WinError 32 from the live_gui subprocess's file lock. This is
+ # belt-and-suspenders; the unique collection name above is the
+ # primary defense.
+ _slop_cache = Path(live_gui_workspace) / ".slop_cache"
+ if _slop_cache.exists():
+  for col_dir in _slop_cache.iterdir():
   if col_dir.is_dir() and col_dir.name.startswith("chroma_"):
-    try:
-     shutil.rmtree(col_dir)
-    except Exception:
-     pass
+    shutil.rmtree(col_dir, ignore_errors=True)

 # 1. Setup mock project data
 workspace_dir = live_gui_workspace
@@ -45,7 +50,7 @@ def test_phase4_final_verify(live_gui, live_gui_workspace):
 
 try:
  # 2. Configure project through Hook API
-  client.set_value('rag_collection_name', 'test_final_verify')
+  client.set_value('rag_collection_name', _collection_name)
  client.set_value('files', ['final_test_1.txt', 'final_test_2.py'])
  client.set_value('rag_enabled', True)
  client.set_value('rag_source', 'chroma')