docs(testing): critical live_gui_workspace path fix + 8 new sections

CRITICAL fix: - live_gui_workspace path: tmp_path_factory (banned) -> tests/artifacts/live_gui_workspace_<timestamp> (per-run timestamp) (per conductor/code_styleguides/workspace_paths.md) 8 new sections under 'Per-test Subprocess Resilience': 1. _reset_clean_baseline autouse fixture (mma_tier_usage + rag_config=default RAGConfig(), not None) 2. Watchdog and Hang Bounding (signal-based, 900s smart + 900s unconditional, replaces removed 30s daemon-thread) 3. Chroma Cache Path (tests/artifacts/.slop_cache/, parent-trailing-slash bug, pre-cleanup pattern in test_rag_phase4_final_verify) 4. xdist Worker Coordination (O_EXCL file lock, PYTEST_XDIST_WORKER, owner/client roles, stale lock demotion) 5. Required Test Dependencies Gate (sentence-transformers, uv sync --extra local-rag fix) 6. MMA and RAG State in reset_session() (5 buckets: mma_tier_usage pre-populated, rag_config fresh RAGConfig() not None) 7. _LiveGuiHandle __getitem__ (handle[0] / handle[1]) Expand 'Audit Script' -> 'Audit Scripts' (4 scripts total): - check_test_toml_paths.py (existing) - audit_main_thread_imports.py (startup_speedup) - audit_weak_types.py (data_structure_strengthening) - audit_no_models_config_io.py (config_state_owner styleguide)
2026-06-10 20:05:16 -04:00
parent 2e12b266e4
commit 5fa8a10ebf
1 changed files with 103 additions and 15 deletions
@@ -133,7 +133,7 @@ Output: `tests/logs/<timestamp>/<script_name>.txt`

 **Lifecycle**:
 1. **Setup (once per session)**:
-   - Create `tests/artifacts/live_gui_workspace/` temp directory
+   - Compute the per-run workspace path: `tests/artifacts/live_gui_workspace_<timestamp>/` (where `<timestamp>` is `datetime.now().strftime("%Y%m%d_%H%M%S")` at conftest import time). Per the `conductor/code_styleguides/workspace_paths.md` hard rule, test workspaces live in the project tree (not `%TEMP%`).
   - Write `manual_slop.toml` and `config.toml` to the workspace
   - Set up `SLOP_*` env vars to point at the workspace
   - Symlink `assets/` for fonts
@@ -148,7 +148,7 @@ Output: `tests/logs/<timestamp>/<script_name>.txt`
   - Close the log file
   - Remove the temp workspace (with 5 retries for Windows file locks)

-**Yield value**: `(process: subprocess.Popen, gui_script: str)` — but most tests just take the fixture and use the `ApiHookClient` directly.
+**Yield value**: A `_LiveGuiHandle` object (see below). Most tests just take the fixture and use the `ApiHookClient` directly.

 **Usage pattern**:
 ```python
@@ -180,9 +180,11 @@ The `live_gui` fixture yields a `_LiveGuiHandle` instead of a `(process, gui_scr

 **Backward compat:** The handle is iterable as `(process, gui_script)`, so existing `proc, _ = live_gui` patterns still work.

-### `live_gui_workspace` fixture (tests/conftest.py:657)
+### `live_gui_workspace` fixture (tests/conftest.py:727)

-Yields the absolute path to the live_gui subprocess's workspace (a `tmp_path_factory.mktemp("live_gui_workspace")` directory in pytest's tmp dir). Tests that need to create files in the workspace should request this fixture instead of hardcoding `Path("tests/artifacts/live_gui_workspace")`.
+Yields `live_gui.handle.workspace`, a `Path` to `tests/artifacts/live_gui_workspace_<timestamp>/` (computed at conftest import time). Tests that need to create files in the workspace should request this fixture instead of hardcoding the path.
+
+> **Important:** This is NOT `tmp_path_factory.mktemp()` despite the older documentation. The `workspace_paths.md` styleguide bans `tmp_path_factory` for test infrastructure workspaces (they'd live in `%TEMP%` and be unfindable). The path is in the project tree, under `tests/artifacts/`, and is shared across all xdist workers in a session (with a file-lock for spawn coordination).

 ```python
 def test_rag_setup(live_gui, live_gui_workspace):
@@ -208,6 +210,87 @@ def test_rag_final_verify(live_gui):

 Use this for tests that are sensitive to controller state pollution from prior tests in the same session. The `test_rag_phase4_final_verify` test is marked this way because the 4 sims in `test_extended_sims.py` mutate controller state (provider, model, etc.) that would otherwise pollute the RAG test.

+### `_reset_clean_baseline` autouse fixture
+
+The `clean_baseline` marker triggers an autouse `_reset_clean_baseline` fixture that calls `client.reset_session()` before the test starts. The `_handle_reset_session` controller method must clear **both** MMA state and RAG state to prevent cross-test pollution:
+- **MMA**: `mma_tier_usage`, `mma_status`, `active_tier`
+- **RAG**: `rag_engine` (if disabled), `rag_config` (reset to a fresh `RAGConfig()`, not `None` — `rag_config=None` makes all `rag_*` setters no-op because they check `if self.rag_config:`)
+
+See `tests/test_reset_session_clears_mma_and_rag.py` for the regression test that locked this contract in.
+
+### Watchdog and Hang Bounding
+
+The conftest's hang-bounding is **signal-based**, not naive `time.sleep` + `os._exit`:
+
+| Watchdog | Timeout | Trigger | Action |
+|---|---|---|---|
+| **Smart watchdog** (`_smart_watchdog_exit`) | 900s (15 min) | Waits for `_pytest_finished_event` | On signal: 5s grace, then `os._exit(0)`. On timeout: `os._exit(2)` (hard fail). |
+| **Unconditional watchdog** (`_unconditional_watchdog_exit`) | 900s (15 min, longer safety net) | Same event | Same action. Catches the case where pytest is so slow the smart expires first. |
+
+`_pytest_finished_event` is set in two hooks for redundancy: `pytest_terminal_summary` (primary — after the summary is printed) and `pytest_unconfigure` (fallback). The prior daemon-thread approach (30s `os._exit(0)`) was removed because it killed batches mid-test on Windows (daemon threads are not auto-killed by the interpreter) and hid pytest's exit code from `run_tests_batched.py`. See `conductor/workflow.md` for the broader rationale.
+
+### Chroma Cache Path and Cross-Test Pollution
+
+The chroma cache lives at `tests/artifacts/.slop_cache/chroma_<project>/`, **NOT** the per-run `live_gui_workspace_<timestamp>` subdir. This is because `active_project_root = Path(active_project_path).parent` and some test setups produce a trailing slash on the project path, placing the cache one level higher than expected.
+
+**Implication for tests:** Tests that interact with RAG should pre-clean the chroma cache to avoid persistent state from prior tests in the batch. `tests/test_rag_phase4_final_verify.py` does this:
+
+```python
+def test_phase4_final_verify(live_gui):
+    # Wipe any stale chroma from prior batched runs
+    cache = Path("tests/artifacts/.slop_cache/chroma_test_final_verify")
+    if cache.exists():
+        shutil.rmtree(cache, ignore_errors=True)
+    # ... rest of test
+```
+
+Without this cleanup, a prior batched run with a different embedding provider (e.g., Gemini 3072-dim vs local 384-dim) can leave a corrupt collection that fails the next test's `search()` with a dim-mismatch error. The `_validate_collection_dim()` mechanism in `RAGEngine` also auto-recovers (see [guide_rag.md](guide_rag.md#dimension-mismatch-protection)) but pre-cleaning is faster and avoids the stderr warning.
+
+### xdist Worker Coordination and Stale Lock Demotion
+
+The `live_gui` fixture uses an `O_EXCL`-atomic file lock at `tests/artifacts/live_gui_workspace_*/.live_gui_owner.lock`:
+
+1. Each xdist worker reads its ID from `PYTEST_XDIST_WORKER` env var (e.g., `gw0`, `gw1`, ...).
+2. The first worker to acquire the file lock becomes the **owner** and spawns `sloppy.py --enable-test-hooks`.
+3. Other workers become **clients** and wait for the hook server to respond on `127.0.0.1:8999`, then yield a `_LiveGuiHandle(process=None, ...)` (null process — they share the owner's subprocess via the hook API).
+4. If the owner worker's hook server is **not** up when a client acquires the lock (e.g., owner crashed or hasn't started), the client demotes itself: removes the stale lock and re-tries as the owner. This prevents one crashed owner from blocking all 16 workers.
+
+### Required Test Dependencies Gate
+
+`_check_required_test_dependencies()` runs in `pytest_configure` and raises `pytest.UsageError` at session start if a required test dep is missing:
+
+| Module | Package | Fix command |
+|---|---|---|
+| `sentence_transformers` | `sentence-transformers` | `uv sync --extra local-rag` |
+
+This is a regression gate for the 2026-06-09 incident where a fresh `uv sync` (without `--extra local-rag`) produced a confusing `rag_status = error: ... Install with manual_slop[local-rag]` failure inside the live_gui subprocess. The gate fails fast with the exact fix command instead. To add a new required test dep, append `(module_name, package_name)` to `_REQUIRED_TEST_IMPORTS` in `tests/conftest.py`.
+
+### MMA and RAG State in `reset_session()`
+
+`controller._handle_reset_session()` must clear 5 specific state buckets to prevent cross-test pollution:
+
+| Bucket | What gets reset | Why |
+|---|---|---|
+| `mma_tier_usage` | Pre-populated to full default shape (input, output, provider, model, tool_preset) — NOT empty dict | Downstream `_flush_to_project` does `d["model"]`; empty dict causes `KeyError`. |
+| `mma_status` | `None` | Status machine reset. |
+| `active_tier` | `None` | Current tier cleared. |
+| `rag_engine` | `None` if RAG disabled; unchanged if enabled (teardown is expensive) | Avoids 30s+ chroma re-init per test. |
+| `rag_config` | Fresh `RAGConfig()` default (not `None`) | All `rag_*` setters check `if self.rag_config:` and become no-ops if None. |
+
+The regression tests are in `tests/test_reset_session_clears_mma_and_rag.py` (poll-for-state pattern, not `time.sleep`).
+
+### `_LiveGuiHandle` Indexing (`__getitem__`)
+
+In addition to the `__iter__` tuple-unpacking backward compat, `_LiveGuiHandle` supports `__getitem__` indexing:
+
+| Expression | Returns |
+|---|---|
+| `handle[0]` | `process` |
+| `handle[1]` | `gui_script` |
+| `handle[n]` for `n >= 2` | `IndexError` |
+
+Both patterns are valid and equivalent. New code should prefer the named attributes (`.process`, `.gui_script`, `.workspace`) for clarity; the indexing is for backward compat with the old tuple-yielding fixture.
+
 ---

 ## Test Categories
@@ -590,21 +673,26 @@ uv run pytest --lf

 ---

-## The `Audit Script`
+## The `Audit Scripts`

-`scripts/check_test_toml_paths.py` greps `tests/` for direct `./<name>.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule.
+The project has 4 audit scripts that enforce static conventions. They run as pre-commit/CI gates and exit non-zero on regression.

-**Run it**:
-```bash
-python scripts/check_test_toml_paths.py
-```
+| Script | Enforces | Run command |
+|---|---|---|
+| `scripts/check_test_toml_paths.py` | No real-TOML references in `tests/` (must be sandboxed) | `python scripts/check_test_toml_paths.py` |
+| `scripts/audit_main_thread_imports.py` | Main-thread-purity invariant: the main thread (entering `immapp.run()`) never imports a module heavier than `imgui_bundle` + lean `gui_2` skeleton. Heavy SDKs (`google.genai`, `anthropic`, `openai`, `fastapi`) are lazy-only. | `python scripts/audit_main_thread_imports.py` |
+| `scripts/audit_weak_types.py` | Type-alias convention: 430 weak `dict[str, Any]` / `list[dict[...]]` / `Tuple[...]` types across 6 high-traffic files were replaced with named `TypeAlias`es in `src/type_aliases.py`. The audit enforces the convention going forward. | `python scripts/audit_weak_types.py` |
+| `scripts/audit_no_models_config_io.py` | `AppController` is the single source of truth for config I/O; direct calls to `models.save_config` / `models.load_config` in `src/` are forbidden. | `python scripts/audit_no_models_config_io.py` |

-**Expected output**:
-```
-OK: No tests reference real TOML files.
-```
+**Per-script details:**

-If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`.
+**`check_test_toml_paths.py`** greps `tests/` for direct `./<name>.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`.
+
+**`audit_main_thread_imports.py`** (added in `startup_speedup_20260606`) enforces the main-thread-purity invariant. It scans `src/` for `import` statements that pull in heavy SDKs at module level. Each violation is a site that will re-introduce a multi-second startup cost. The startup_speedup track reduced violations from 67 to 62 (5 fixed; the remaining 62 are large refactors tracked in future work).
+
+**`audit_weak_types.py`** (added in `data_structure_strengthening_20260606`) enforces the type-alias convention. The baseline file `scripts/audit_weak_types.baseline.json` records the post-refactor weak-type count (~60 after 86% reduction from 430 in 6 high-traffic files). New violations exit 1; the baseline lets you audit the delta.
+
+**`audit_no_models_config_io.py`** enforces the "AppController is the sole config owner" rule from `conductor/code_styleguides/config_state_owner.md`.

 ---