docs(testing): critical live_gui_workspace path fix + 8 new sections

CRITICAL fix: - live_gui_workspace path: tmp_path_factory (banned) -> tests/artifacts/live_gui_workspace_<timestamp> (per-run timestamp) (per conductor/code_styleguides/workspace_paths.md) 8 new sections under 'Per-test Subprocess Resilience': 1. _reset_clean_baseline autouse fixture (mma_tier_usage + rag_config=default RAGConfig(), not None) 2. Watchdog and Hang Bounding (signal-based, 900s smart + 900s unconditional, replaces removed 30s daemon-thread) 3. Chroma Cache Path (tests/artifacts/.slop_cache/, parent-trailing-slash bug, pre-cleanup pattern in test_rag_phase4_final_verify) 4. xdist Worker Coordination (O_EXCL file lock, PYTEST_XDIST_WORKER, owner/client roles, stale lock demotion) 5. Required Test Dependencies Gate (sentence-transformers, uv sync --extra local-rag fix) 6. MMA and RAG State in reset_session() (5 buckets: mma_tier_usage pre-populated, rag_config fresh RAGConfig() not None) 7. _LiveGuiHandle __getitem__ (handle[0] / handle[1]) Expand 'Audit Script' -> 'Audit Scripts' (4 scripts total): - check_test_toml_paths.py (existing) - audit_main_thread_imports.py (startup_speedup) - audit_weak_types.py (data_structure_strengthening) - audit_no_models_config_io.py (config_state_owner styleguide)
2026-06-10 20:05:16 -04:00
parent 2e12b266e4
commit 5fa8a10ebf
1 changed files with 103 additions and 15 deletions
@@ -133,7 +133,7 @@ Output: `tests/logs/<timestamp>/<script_name>.txt`
 **Lifecycle**:
 1. **Setup (once per session)**:
-   - Create `tests/artifacts/live_gui_workspace/` temp directory
+   - Compute the per-run workspace path: `tests/artifacts/live_gui_workspace_<timestamp>/` (where `<timestamp>` is `datetime.now().strftime("%Y%m%d_%H%M%S")` at conftest import time). Per the `conductor/code_styleguides/workspace_paths.md` hard rule, test workspaces live in the project tree (not `%TEMP%`).
   - Write `manual_slop.toml` and `config.toml` to the workspace
   - Set up `SLOP_*` env vars to point at the workspace
   - Symlink `assets/` for fonts
@@ -148,7 +148,7 @@ Output: `tests/logs/<timestamp>/<script_name>.txt`
   - Close the log file
   - Remove the temp workspace (with 5 retries for Windows file locks)
-**Yield value**: `(process: subprocess.Popen, gui_script: str)` — but most tests just take the fixture and use the `ApiHookClient` directly.
+**Yield value**: A `_LiveGuiHandle` object (see below). Most tests just take the fixture and use the `ApiHookClient` directly.
 **Usage pattern**:
 ```python
@@ -180,9 +180,11 @@ The `live_gui` fixture yields a `_LiveGuiHandle` instead of a `(process, gui_scr
 **Backward compat:** The handle is iterable as `(process, gui_script)`, so existing `proc, _ = live_gui` patterns still work.
-### `live_gui_workspace` fixture (tests/conftest.py:657)
+### `live_gui_workspace` fixture (tests/conftest.py:727)
-Yields the absolute path to the live_gui subprocess's workspace (a `tmp_path_factory.mktemp("live_gui_workspace")` directory in pytest's tmp dir). Tests that need to create files in the workspace should request this fixture instead of hardcoding `Path("tests/artifacts/live_gui_workspace")`.
+Yields `live_gui.handle.workspace`, a `Path` to `tests/artifacts/live_gui_workspace_<timestamp>/` (computed at conftest import time). Tests that need to create files in the workspace should request this fixture instead of hardcoding the path.
 > **Important:** This is NOT `tmp_path_factory.mktemp()` despite the older documentation. The `workspace_paths.md` styleguide bans `tmp_path_factory` for test infrastructure workspaces (they'd live in `%TEMP%` and be unfindable). The path is in the project tree, under `tests/artifacts/`, and is shared across all xdist workers in a session (with a file-lock for spawn coordination).
 ```python
 def test_rag_setup(live_gui, live_gui_workspace):
@@ -208,6 +210,87 @@ def test_rag_final_verify(live_gui):
 Use this for tests that are sensitive to controller state pollution from prior tests in the same session. The `test_rag_phase4_final_verify` test is marked this way because the 4 sims in `test_extended_sims.py` mutate controller state (provider, model, etc.) that would otherwise pollute the RAG test.
 ### `_reset_clean_baseline` autouse fixture
 The `clean_baseline` marker triggers an autouse `_reset_clean_baseline` fixture that calls `client.reset_session()` before the test starts. The `_handle_reset_session` controller method must clear **both** MMA state and RAG state to prevent cross-test pollution:
 - **MMA**: `mma_tier_usage`, `mma_status`, `active_tier`
 - **RAG**: `rag_engine` (if disabled), `rag_config` (reset to a fresh `RAGConfig()`, not `None` — `rag_config=None` makes all `rag_*` setters no-op because they check `if self.rag_config:`)
 See `tests/test_reset_session_clears_mma_and_rag.py` for the regression test that locked this contract in.
 ### Watchdog and Hang Bounding
 The conftest's hang-bounding is **signal-based**, not naive `time.sleep` + `os._exit`:
 | Watchdog | Timeout | Trigger | Action |
 |---|---|---|---|
 | **Smart watchdog** (`_smart_watchdog_exit`) | 900s (15 min) | Waits for `_pytest_finished_event` | On signal: 5s grace, then `os._exit(0)`. On timeout: `os._exit(2)` (hard fail). |
 | **Unconditional watchdog** (`_unconditional_watchdog_exit`) | 900s (15 min, longer safety net) | Same event | Same action. Catches the case where pytest is so slow the smart expires first. |
 `_pytest_finished_event` is set in two hooks for redundancy: `pytest_terminal_summary` (primary — after the summary is printed) and `pytest_unconfigure` (fallback). The prior daemon-thread approach (30s `os._exit(0)`) was removed because it killed batches mid-test on Windows (daemon threads are not auto-killed by the interpreter) and hid pytest's exit code from `run_tests_batched.py`. See `conductor/workflow.md` for the broader rationale.
 ### Chroma Cache Path and Cross-Test Pollution
 The chroma cache lives at `tests/artifacts/.slop_cache/chroma_<project>/`, **NOT** the per-run `live_gui_workspace_<timestamp>` subdir. This is because `active_project_root = Path(active_project_path).parent` and some test setups produce a trailing slash on the project path, placing the cache one level higher than expected.
 **Implication for tests:** Tests that interact with RAG should pre-clean the chroma cache to avoid persistent state from prior tests in the batch. `tests/test_rag_phase4_final_verify.py` does this:
 ```python
 def test_phase4_final_verify(live_gui):
    # Wipe any stale chroma from prior batched runs
    cache = Path("tests/artifacts/.slop_cache/chroma_test_final_verify")
    if cache.exists():
        shutil.rmtree(cache, ignore_errors=True)
    # ... rest of test
 ```
 Without this cleanup, a prior batched run with a different embedding provider (e.g., Gemini 3072-dim vs local 384-dim) can leave a corrupt collection that fails the next test's `search()` with a dim-mismatch error. The `_validate_collection_dim()` mechanism in `RAGEngine` also auto-recovers (see [guide_rag.md](guide_rag.md#dimension-mismatch-protection)) but pre-cleaning is faster and avoids the stderr warning.
 ### xdist Worker Coordination and Stale Lock Demotion
 The `live_gui` fixture uses an `O_EXCL`-atomic file lock at `tests/artifacts/live_gui_workspace_*/.live_gui_owner.lock`:
 1. Each xdist worker reads its ID from `PYTEST_XDIST_WORKER` env var (e.g., `gw0`, `gw1`, ...).
 2. The first worker to acquire the file lock becomes the **owner** and spawns `sloppy.py --enable-test-hooks`.
 3. Other workers become **clients** and wait for the hook server to respond on `127.0.0.1:8999`, then yield a `_LiveGuiHandle(process=None, ...)` (null process — they share the owner's subprocess via the hook API).
 4. If the owner worker's hook server is **not** up when a client acquires the lock (e.g., owner crashed or hasn't started), the client demotes itself: removes the stale lock and re-tries as the owner. This prevents one crashed owner from blocking all 16 workers.
 ### Required Test Dependencies Gate
 `_check_required_test_dependencies()` runs in `pytest_configure` and raises `pytest.UsageError` at session start if a required test dep is missing:
 | Module | Package | Fix command |
 |---|---|---|
 | `sentence_transformers` | `sentence-transformers` | `uv sync --extra local-rag` |
 This is a regression gate for the 2026-06-09 incident where a fresh `uv sync` (without `--extra local-rag`) produced a confusing `rag_status = error: ... Install with manual_slop[local-rag]` failure inside the live_gui subprocess. The gate fails fast with the exact fix command instead. To add a new required test dep, append `(module_name, package_name)` to `_REQUIRED_TEST_IMPORTS` in `tests/conftest.py`.
 ### MMA and RAG State in `reset_session()`
 `controller._handle_reset_session()` must clear 5 specific state buckets to prevent cross-test pollution:
 | Bucket | What gets reset | Why |
 |---|---|---|
 | `mma_tier_usage` | Pre-populated to full default shape (input, output, provider, model, tool_preset) — NOT empty dict | Downstream `_flush_to_project` does `d["model"]`; empty dict causes `KeyError`. |
 | `mma_status` | `None` | Status machine reset. |
 | `active_tier` | `None` | Current tier cleared. |
 | `rag_engine` | `None` if RAG disabled; unchanged if enabled (teardown is expensive) | Avoids 30s+ chroma re-init per test. |
 | `rag_config` | Fresh `RAGConfig()` default (not `None`) | All `rag_*` setters check `if self.rag_config:` and become no-ops if None. |
 The regression tests are in `tests/test_reset_session_clears_mma_and_rag.py` (poll-for-state pattern, not `time.sleep`).
 ### `_LiveGuiHandle` Indexing (`__getitem__`)
 In addition to the `__iter__` tuple-unpacking backward compat, `_LiveGuiHandle` supports `__getitem__` indexing:
 | Expression | Returns |
 |---|---|
 | `handle[0]` | `process` |
 | `handle[1]` | `gui_script` |
 | `handle[n]` for `n >= 2` | `IndexError` |
 Both patterns are valid and equivalent. New code should prefer the named attributes (`.process`, `.gui_script`, `.workspace`) for clarity; the indexing is for backward compat with the old tuple-yielding fixture.
 ---
 ## Test Categories
@@ -590,21 +673,26 @@ uv run pytest --lf
 ---
-## The `Audit Script`
+## The `Audit Scripts`
-`scripts/check_test_toml_paths.py` greps `tests/` for direct `./<name>.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule.
+The project has 4 audit scripts that enforce static conventions. They run as pre-commit/CI gates and exit non-zero on regression.
-**Run it**:
+| Script | Enforces | Run command |
-```bash
+|---|---|---|
-python scripts/check_test_toml_paths.py
+| `scripts/check_test_toml_paths.py` | No real-TOML references in `tests/` (must be sandboxed) | `python scripts/check_test_toml_paths.py` |
-```
+| `scripts/audit_main_thread_imports.py` | Main-thread-purity invariant: the main thread (entering `immapp.run()`) never imports a module heavier than `imgui_bundle` + lean `gui_2` skeleton. Heavy SDKs (`google.genai`, `anthropic`, `openai`, `fastapi`) are lazy-only. | `python scripts/audit_main_thread_imports.py` |
 | `scripts/audit_weak_types.py` | Type-alias convention: 430 weak `dict[str, Any]` / `list[dict[...]]` / `Tuple[...]` types across 6 high-traffic files were replaced with named `TypeAlias`es in `src/type_aliases.py`. The audit enforces the convention going forward. | `python scripts/audit_weak_types.py` |
 | `scripts/audit_no_models_config_io.py` | `AppController` is the single source of truth for config I/O; direct calls to `models.save_config` / `models.load_config` in `src/` are forbidden. | `python scripts/audit_no_models_config_io.py` |
-**Expected output**:
+**Per-script details:**
 ```
 OK: No tests reference real TOML files.
 ```
-If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`.
+**`check_test_toml_paths.py`** greps `tests/` for direct `./<name>.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`.
 **`audit_main_thread_imports.py`** (added in `startup_speedup_20260606`) enforces the main-thread-purity invariant. It scans `src/` for `import` statements that pull in heavy SDKs at module level. Each violation is a site that will re-introduce a multi-second startup cost. The startup_speedup track reduced violations from 67 to 62 (5 fixed; the remaining 62 are large refactors tracked in future work).
 **`audit_weak_types.py`** (added in `data_structure_strengthening_20260606`) enforces the type-alias convention. The baseline file `scripts/audit_weak_types.baseline.json` records the post-refactor weak-type count (~60 after 86% reduction from 430 in 6 high-traffic files). New violations exit 1; the baseline lets you audit the delta.
 **`audit_no_models_config_io.py`** enforces the "AppController is the sole config owner" rule from `conductor/code_styleguides/config_state_owner.md`.
 ---