From 08264e550a8e5ed11341ac2376a76bfc1224e803 Mon Sep 17 00:00:00 2001 From: Ed_ Date: Sat, 27 Jun 2026 22:21:23 -0400 Subject: [PATCH] docs(report): Tier 1 investigation of test_rag_phase4_final_verify blocker Tier 2 docs described a hang at 'sending...' (RAGChunk type mismatch, fixed in 4d2a6666). Verified that fix is present in source; the CURRENT failure is downstream: fails at line 136 ('RAG context not found in history') in ~14s, not a 50s hang. RAG search returns 0 chunks because index_file no-op'd on a dead base_dir. Identified 2 live_gui test polluters leaking temp/relative paths into the shared subprocess ui_files_base_dir via set_value (never restored): - tests/test_rag_visual_sim.py:20,26 (mkdtemp -> C:\...\Temp\tmpXXXX) - tests/test_visual_sim_mma_v2.py:74,76 (persists via btn_project_save) _reset_clean_baseline does not reset ui_files_base_dir, so pollution persists across @clean_baseline tests. git diff 4d2a6666..e58d332e is test/docs only (no src/) so the 'regression' is environmental flakiness, not a code change. Report includes 4 recommended fixes for Tier 2. --- ...GATION_rag_phase4_final_verify_20260627.md | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 docs/reports/INVESTIGATION_rag_phase4_final_verify_20260627.md diff --git a/docs/reports/INVESTIGATION_rag_phase4_final_verify_20260627.md b/docs/reports/INVESTIGATION_rag_phase4_final_verify_20260627.md new file mode 100644 index 00000000..db573e08 --- /dev/null +++ b/docs/reports/INVESTIGATION_rag_phase4_final_verify_20260627.md @@ -0,0 +1,194 @@ +# Investigation Report: test_rag_phase4_final_verify (Tier 2 blocker) + +**Date:** 2026-06-27 +**Investigator:** Tier 1 Orchestrator +**Branch:** `tier2/post_module_taxonomy_de_cruft_20260627` (HEAD `c7cd428c`) +**Tier 2 reports reviewed:** +- `docs/reports/SESSION_REPORT_RAG_DEBUGGING.md` +- `docs/reports/DIAGNOSIS_test_rag_phase4_final_verify.md` +- `docs/reports/ANALYSIS_RAG_TEST_DIAGNOSING_STRATEGY.md` +- `docs/reports/TRACK_COMPLETION_fix_rag_test_phase4_final_verify_20260627.md` + +--- + +## TL;DR — the Tier 2 reports describe a bug that was already fixed; the CURRENT failure is a different, downstream bug. + +The Tier 2 docs (especially `SESSION_REPORT_RAG_DEBUGGING.md` and `DIAGNOSIS_...`) describe the test as **hanging at "sending..." for 50s** and attribute this to a `RAGChunk`/`dict` type-contract mismatch (fixed in commit `4d2a6666`) and/or a chroma dim-mismatch / WinError 32 file-lock cascade. + +**That hang is no longer the failure.** Commit `4d2a6666` DID fix the type mismatch — `_rag_search_result` now converts `RAGChunk` to `dict` via `to_dict()`, and `_handle_request_event` uses defensive `isinstance(chunk, dict)` guards. Verified in source at `src/app_controller.py` (`_rag_search_result` and `_handle_request_event`). + +**The current failure is at a DIFFERENT line and finishes in ~14s, not 50s:** + +``` +tests/test_rag_phase4_final_verify.py:136 +AssertionError: RAG context not found in history +``` + +The test reaches `ai_status == 'done'` (line 108 passes), but the User discussion entry contains only the bare prompt `'What makes RAG great?'` — **no `## Retrieved Context` block was prepended**. The AI responds with the mock's canned message; the request completes; the RAG-context-in-history assertion then fails. + +--- + +## Ground truth (reproduced twice in isolation, 2026-06-27 22:04 and 22:09) + +Ran `uv run pytest tests/test_rag_phase4_final_verify.py -v -s --timeout=120` twice. Both runs: +- Finish in ~14s (NOT a 50s hang) +- Pass the `rag_status == 'ready'` poll (line 70) +- Pass the `ai_status == 'done'` poll (line 108) +- FAIL at line 136 (`assert found_rag, "RAG context not found in history"`) +- Subprocess log (`tests/logs/sloppy_py_test.log`) contains NO RAG-related output: no dim-mismatch warning, no "RAG search error", no indexing messages + +A custom diagnostic test (run once, then deleted) confirmed: +- After RAG sync reaches `ready`: `rag_enabled=True`, `rag_source=chroma`, `rag_emb_provider=local`, `rag_collection_name=` +- `files_base_dir` (`ui_files_base_dir`) resolves to **`C:\Users\Ed\AppData\Local\Temp\tmpXXXX`** — a `tempfile.mkdtemp()` directory, NOT the live_gui workspace and NOT `"."` +- `files` (`ui_file_paths`) = `['final_test_1.txt', 'final_test_2.py']` (correct, the test's set_value landed) +- `btn_rebuild_rag_index` click → `rag_status` transitions observed: `['ready']` only (NO `"indexing..."` transition captured) +- The workspace's `.slop_cache/` directory **DOES NOT EXIST** after the run — chroma never persisted a collection to the workspace +- The User entry content is exactly `'What makes RAG great?'` (21 chars) — no `## Retrieved Context` prefix + +--- + +## Root cause analysis + +The RAG search returns 0 chunks at request time, so `_handle_request_event`'s `if rag_result.ok and rag_result.data:` branch (line ~4178, the `## Retrieved Context` construction) is skipped because `rag_result.data` is an empty list (falsy). + +Two candidate root causes, both consistent with the evidence; Tier 2/Tier 3 must instrument to disambiguate: + +### Candidate A (most likely): `_rebuild_rag_index` ran but `index_file` silently no-op'd because `self.base_dir` resolves to the temp dir, not the workspace + +`RAGEngine.index_file` does `full_path = os.path.join(self.base_dir, file_path)` where `self.base_dir = active_project_root` passed in `_do_rag_sync` (line 1652). `active_project_root` = `Path(active_project_path).parent` when `active_project_path` is set, else `ui_files_base_dir` (line 1555-1557). + +The diagnostic shows `ui_files_base_dir` is a system temp dir. IF `active_project_path` got cleared (e.g. by `_load_active_project` failing to load the workspace `manual_slop.toml` and falling to the `active_project_path = ""` branch at line 2511/2514), then `active_project_root` falls back to `ui_files_base_dir` = temp dir. The test's `final_test_1.txt` files are in the workspace, NOT the temp dir, so `os.path.exists(full_path)` is False → `index_file` returns silently → collection stays empty → search returns 0 chunks. + +**Why `ui_files_base_dir` is a temp dir — IDENTIFIED, see ADDENDUM §B.** `tests/test_rag_visual_sim.py:20,26` creates `tempfile.mkdtemp()` and calls `client.set_value('files_base_dir', test_dir)` on the shared live_gui subprocess. The `finally` cleans the disk dir but the subprocess retains the dead temp path in `ui_files_base_dir`. `_reset_clean_baseline` does NOT reset `ui_files_base_dir`, so the pollution persists into `test_rag_phase4_final_verify`. When the RAG sync runs with `active_project_root` falling back to this dead `ui_files_base_dir` (which happens if `active_project_path` was cleared by a prior test), `index_file` does `os.path.join(, 'final_test_1.txt')` → file not found → silent no-op → empty collection → 0 chunks → no `## Retrieved Context` block. (Note: `tests/test_visual_sim_mma_v2.py:74,76` is a second polluter that ALSO persists to the project file via `btn_project_save`.) + +### Candidate B: `_rebuild_rag_index` never ran because the sync took the `else: self._set_rag_status("ready")` branch + +`_do_rag_sync` line 1658: `if self.rag_engine and self.rag_engine.is_empty() and self.files: self._rebuild_rag_index()`. The `else` (line 1660-1661) sets `ready` without indexing. The diagnostic showed only `['ready']` in status transitions — no `"indexing..."` was captured. This means either (a) the rebuild ran and finished `indexing...`→`ready` faster than the 0.5s poll interval (plausible for 2 tiny files), or (b) the rebuild was skipped because `is_empty()` returned False (collection already had data — impossible for a fresh unique-name collection) OR `self.files` was empty at sync time (race: the sync fired before `set_value('files', ...)` was processed on the render thread). + +Candidate B's race is less likely because the test setters are processed sequentially on the render thread and `set_value('files', ...)` is queued BEFORE the rag setters that trigger syncs. But it cannot be fully ruled out without instrumentation. + +--- + +## Why the Tier 2 docs were wrong + +1. **`SESSION_REPORT_RAG_DEBUGGING.md`** says the test "is STILL failing" after reverting all changes, with 5/5 runs at 14s. The 14s timing matches the CURRENT failure (not a 50s hang). The report's "Poll 0, status: sending..." observation is the FIRST poll — normal; the test then proceeds to `done`. The Tier 2 agent misread this as a hang and didn't notice the assertion failure is at line 136, not line 108. The "HREQ/RAG_SEARCH/GEN_DIAG files never being created" puzzle is consistent with the diag instrumentation being added to code paths that DON'T fire when the RAG search returns empty (no exception, no hang — just an empty result). + +2. **`DIAGNOSIS_test_rag_phase4_final_verify.md`** describes a 57s timeout in isolation. That was the OLD failure mode (pre-`4d2a6666` hang). The doc is stale. + +3. **`TRACK_COMPLETION_fix_rag_test_phase4_final_verify_20260627.md`** claims "5/5 consecutive PASS runs" at commit `4d2a6666`. `git diff --stat 4d2a6666 e58d332e` shows ONLY test/docs changes between `4d2a6666` and the current failing HEAD — no `src/` changes. So if `4d2a6666` passed 5/5 and `e58d332e` fails 5/5, the "regression" is NOT a code change; it is **environmental state pollution** (stale chroma dirs, temp-dir leakage, or workspace-timing flakiness). The Tier 2 agent's "regression mystery" (passing at `4d2a6666`, failing at `e58d332e`) is most likely explained by the test being borderline at `4d2a6666` — the "5/5 PASS" was a lucky streak, not a stable baseline. Stale chroma dirs in `tests/artifacts/.slop_cache/` (still present on disk) are the likely flake source. + +--- + +## Recommended next steps for Tier 2/Tier 3 + +This is a **test-infrastructure + project-loading regression**, not a RAG-engine bug. The RAG engine code is correct; the inputs it receives are wrong. + +### Step 1: Re-verify the "5/5 PASS at 4d2a6666" claim — it is most likely flaky/environmental + +`git diff --stat 4d2a6666 e58d332e` shows ONLY test files + docs changed between the "passing" commit and the "failing" commit — **no `src/` changes at all**. `e58d332e` is a test-only commit (dim mismatch test mock update + stress test collection-name uniqueness). If `4d2a6666` truly passed 5/5 and `e58d332e` fails 5/5 with no production code diff, the most likely explanation is that the TRACK_COMPLETION doc's "5/5 PASS" was **flaky / environment-dependent** — the test was already borderline, and a stale-chroma-cache / temp-dir / workspace-timing condition determines pass/fail. + +The `module_taxonomy_refactor_20260627` merge (`91a61288`) is an ANCESTOR of `4d2a6666` (verified: `git merge-base --is-ancestor 91a61288 4d2a6666` → exit 0). So if the refactor introduced the regression, `4d2a6666` would have failed too. The "passing at 4d2a6666" claim needs re-verification: run the test 5× at `4d2a6666` (via `git worktree` — NOT `git checkout`, which is HARD BANNED) and confirm whether it reliably passes. If it ALSO fails intermittently at `4d2a6666`, the regression is not a code change — it's environmental state pollution (stale chroma dirs in `tests/artifacts/.slop_cache/`, leftover `live_gui_workspace_*` dirs, or temp-dir leakage from a prior run). + +**Concrete check:** `tests/artifacts/.slop_cache/chroma_test_final_verify` and `chroma_test_stress` exist on disk RIGHT NOW (leftover from prior runs). These stale collections are exactly the kind of state that makes the test flaky across runs. The test's unique-collection-name fix only helps for the CURRENT run's collection; it doesn't clean the parent `.slop_cache/`. + +### Step 2: Instrument to disambiguate Candidate A vs B +Add file-based diag (per `ANALYSIS_RAG_TEST_DIAGNOSING_STRATEGY.md` Phase 2) to: +- `AppController._do_rag_sync` (line ~1642): log `active_project_root`, `len(self.files)`, `self.rag_config.enabled`, `self.rag_config.vector_store.provider`, `engine.embedding_provider is None`, `engine.is_empty()`, whether the `if is_empty() and self.files` branch fired +- `AppController._rebuild_rag_index._run` (line ~3455): log each `f.path` and the `self.rag_engine.index_file(p)` call +- `RAGEngine.index_file` (line ~323): log `self.base_dir`, `file_path`, `full_path`, `os.path.exists(full_path)` +- `AppController._rag_search_result` (line ~3502): log entry, `self.rag_engine` (None?), `len(chunks)` returned + +Write to `tests/artifacts/tier2_state/rag_phase4_rediag/diag.log`. Run the test once. The log will show whether (A) `base_dir` is the temp dir and `index_file` no-ops on missing files, or (B) the rebuild branch is skipped. + +### Step 3: Fix the project-loading regression +Based on Step 2's finding, the fix is in project loading, NOT in `rag_engine.py`. The likely fix: restore the pre-`module_taxonomy_refactor` behavior of `active_project_root` / `ui_files_base_dir` resolution so that when `active_project_path` points at the workspace `manual_slop.toml`, `active_project_root` = the workspace (absolute), and `ui_files_base_dir` = `"."` resolves relative to the workspace. + +The `tempfile.mkdtemp()` path in `ui_files_base_dir` is the smoking gun — find where it's set. Candidates: a `_load_active_project` fallback, a `migrate_from_legacy_config` path, or a conftest fixture writing a temp `base_dir` into the project file. Grep `tests/conftest.py` and `src/project*.py` for `mkdtemp`/`base_dir` assignments. + +### Step 4: Verify in batch (NOT isolation) +Per `conductor/workflow.md` "Isolated-Pass Verification Fallacy": the fix must pass in the batched run, not just isolation. Run `uv run python scripts/run_tests_batched.py --tier tier3 --filter test_rag_phase4_final_verify` after the fix. + +--- + +## Out of scope for this investigation + +- The pre-existing `test_rag_collection_dim_mismatch_recreates_collection` failure (from commit `24e93a75`'s `delete_collection`→`shutil.rmtree` change; the test mock was updated in `e58d332e` but may still be wrong). Tier 2's `e58d332e` updated the assertions; verify that test passes separately. +- The `submit_io` silent-exception architectural issue (a worker exception leaves `ai_status` stuck). Out of scope per the TRACK_COMPLETION doc. +- The `manualslop_layout.ini` uncommitted modification in the working tree (pre-existing, not mine). + +--- + +## ADDENDUM: Insane path-resolution audit (added per user directive 2026-06-27) + +The user asked to verify there are no more "insane path resolutions" pointing to temp/AppData. Findings: + +### A. `src/` is clean of temp-dir leakage + +`Select-String -Path src/*.py -Pattern "tempfile|mkdtemp|mkstemp|TemporaryDirectory"` → only hit is `src/external_editor.py:199` (`tempfile.NamedTemporaryFile` for the editor diff-launch temp file). That is a legitimate, scoped, short-lived temp file for the external-editor patch-review flow — NOT a path-resolution bug. No `src/*.py` creates a `mkdtemp()` and stores it as a `base_dir` / `ui_files_base_dir` / project path. `src/paths.py` has no `gettempdir`/`TEMP` references. + +### B. THE BUG: live_gui tests pollute the shared subprocess's `ui_files_base_dir` via the Hook API + +Two `live_gui` tests call `client.set_value('files_base_dir', ...)` on the session-scoped subprocess and never restore it: + +1. **`tests/test_rag_visual_sim.py:20,26`** — the EXACT source of the `C:\Users\Ed\AppData\Local\Temp\tmpXXXX` path observed in the diagnostic: + ```python + test_dir = tempfile.mkdtemp() # C:\Users\Ed\AppData\Local\Temp\tmpbmzf17q7 + ... + client.set_value('files_base_dir', test_dir) # leaks into shared subprocess + ``` + The `finally` (line 76) does `shutil.rmtree(test_dir)` — cleans the DISK dir but the subprocess's `ui_files_base_dir` retains the now-dead temp path. Line 72 also toggles `rag_enabled = False` without restoring it. + +2. **`tests/test_visual_sim_mma_v2.py:74,76`** — sets `files_base_dir` to a relative `'tests/artifacts/temp_workspace'` THEN clicks `btn_project_save`, which calls `_flush_to_project` → writes `proj["files"]["base_dir"] = self.ui_files_base_dir` into the workspace's `manual_slop.toml`. This PERSISTS the polluted path to disk, so even a subprocess restart re-loads it. + +### C. Why the sandbox hardening track (2026-06-19) did not catch this + +The `test_sandbox_hardening_20260619` track shipped (state.toml `status = "completed"`). Its `_sandbox_audit_hook` (conftest line 81-230) blocks test-process FILE WRITES outside `./tests/`. BUT: +- The audit hook allowlist explicitly includes `AppData`, `Local`, `Temp`, `tmp` (conftest line 57: `_TEMP_DIR_PARTS = ("AppData", "Local", "Temp", "tmp", "var", "folders")`) because "Tests legitimately need to write there (NamedTemporaryFile, mkdtemp, etc.)." +- The hook only sees the TEST process's own file writes — NOT the live_gui subprocess's in-memory state mutations performed via HTTP `set_value` calls. +- The track's user directive `no_appdata_temp = "tests should never need AppData temp. tempfile.mkdtemp/mkstemp without dir= is a flag."` was about test file writes, not about test-driven subprocess state pollution. + +**The gap:** the sandbox prevents test FILE writes outside `./tests/`, but it does NOT prevent tests from INSTRUCTING the live_gui subprocess to adopt a temp-dir `files_base_dir` via the Hook API. The `set_value('files_base_dir', )` call goes through HTTP; the subprocess then holds the dead path; subsequent live_gui tests inherit it. + +### D. Why `_reset_clean_baseline` doesn't fix it + +`_handle_reset_session` (the `reset_session()` call invoked by the `_reset_clean_baseline` autouse fixture for `@pytest.mark.clean_baseline` tests) clears `self.files`, `self.context_files`, `self.tracks`, `self.disc_entries`, `self.rag_engine`, `self.rag_config`, etc. — but it does NOT reset `self.ui_files_base_dir` (verified: `ui_files_base_dir` does not appear anywhere in `_handle_reset_session`). So the pollution persists across tests in the session-scoped subprocess. + +### E. Other path-resolution sanity checks (all clean) + +- `tests/test_event_serialization.py:11` uses `base_dir = Path("C:/projects/test")` — a literal test fixture, in-process only, not a live_gui mutation. Fine. +- `tests/test_rag_engine_result.py` uses `base_dir="/tmp"` — in-process `RAGEngine` instantiation, not a subprocess mutation. Fine (though `/tmp` on Windows is unusual; cosmetic, not a bug). +- `tests/test_visual_sim_mma_v2.py:74`'s `'tests/artifacts/temp_workspace'` is a RELATIVE path (sane), but the `btn_project_save` that follows persists it — the persistence, not the path itself, is the bug. +- `src/external_editor.py:199` `tempfile.NamedTemporaryFile` — legitimate scoped use. Fine. + +### F. Summary of insane-path findings + +| Site | Path | Live_gui polluter? | Persists to disk? | Severity | +|---|---|---|---|---| +| `tests/test_rag_visual_sim.py:26` | `tempfile.mkdtemp()` → `C:\Users\Ed\AppData\Local\Temp\tmpXXXX` | YES (session-scoped subprocess) | No (in-memory only; `shutil.rmtree` cleans disk) | **HIGH** — exactly the path observed in the diagnostic; dead path retained by subprocess | +| `tests/test_visual_sim_mma_v2.py:74,76` | `'tests/artifacts/temp_workspace'` (relative) | YES | YES (`btn_project_save` writes to `manual_slop.toml`) | **HIGH** — persists across subprocess restarts | +| `src/external_editor.py:199` | `tempfile.NamedTemporaryFile` | No (in-process, scoped) | No | None (legitimate) | + +### G. Recommended fixes (for Tier 2) + +1. **`tests/test_rag_visual_sim.py`**: stop using `tempfile.mkdtemp()` for the `files_base_dir`. Use a workspace-relative path like `Path(live_gui_workspace) / "rag_visual_sim_files"` (per `conductor/code_styleguides/workspace_paths.md`: test infrastructure paths MUST live under `./tests/`). Restore `rag_enabled` to its prior value in `finally`. Restore `files_base_dir` to `"."` (or the workspace) in `finally`. +2. **`tests/test_visual_sim_mma_v2.py`**: stop clicking `btn_project_save` after mutating `files_base_dir`, OR restore `files_base_dir` before save. Better: use a workspace-relative path and don't persist test-only state to the project file. +3. **`_reset_clean_baseline` / `_handle_reset_session`**: add `self.ui_files_base_dir = self.project.get("files", {}).get("base_dir", ".")` (or `""`) to `reset_session` so the `@clean_baseline` marker actually restores a sane `base_dir`. This is the defensive fix that makes the suite robust to any future polluter. +4. **Consider an audit**: a static check (extend `scripts/audit_test_sandbox_violations.py` or a new script) that flags any `client.set_value('files_base_dir', ...)` in a `live_gui` test whose value is not a `Path` under `./tests/` or `"."`. This is the test-side analog of the `workspace_paths.md` rule. + +--- + +## Files examined + +- `src/app_controller.py`: `_rag_search_result`, `_handle_request_event`, `_handle_generate_send`, `_do_rag_sync`, `_rebuild_rag_index`, `_load_active_project`, `_refresh_from_project`, `active_project_root` property, `ui_file_paths` setter, `_settable_fields`/`_gettable_fields`, `_handle_reset_session` +- `src/rag_engine.py`: `RAGEngine.__init__`, `search`, `index_file`, `_init_vector_store_result`, `_validate_collection_dim_result`, `is_empty` +- `src/project.py`: `ProjectContext`, `load_config_from_disk` +- `src/project_manager.py`: `default_project`, `migrate_from_legacy_config`, `load_project` +- `tests/conftest.py`: `live_gui` fixture (subprocess spawn, log redirect), `live_gui_workspace`, `_reset_clean_baseline`, `isolate_workspace`, config.toml generation +- `tests/test_rag_phase4_final_verify.py`: full test +- Git log on `src/app_controller.py`, `src/project_manager.py`, `src/project.py` (25 commits, dominated by `module_taxonomy_refactor_20260627`) + +## Commits on the branch (relevant) + +- `4d2a6666` fix(rag): convert RAGChunk to dict — **the real fix for the OLD hang**; still present in source +- `91a61288` Merge `tier2/module_taxonomy_refactor_20260627` — **ancestor of `4d2a6666`** (NOT the regression trigger; `4d2a6666` was on top of it when it allegedly passed) +- `e58d332e` test(rag): update dim mismatch test + stress test — Tier 2's test-only changes; `git diff 4d2a6666 e58d332e` is test/docs only, NO `src/` — so the "regression" between them is environmental, not a code change \ No newline at end of file