diff --git a/docs/guide_testing.md b/docs/guide_testing.md index 70ced839..7476f549 100644 --- a/docs/guide_testing.md +++ b/docs/guide_testing.md @@ -133,7 +133,7 @@ Output: `tests/logs//.txt` **Lifecycle**: 1. **Setup (once per session)**: - - Create `tests/artifacts/live_gui_workspace/` temp directory + - Compute the per-run workspace path: `tests/artifacts/live_gui_workspace_/` (where `` is `datetime.now().strftime("%Y%m%d_%H%M%S")` at conftest import time). Per the `conductor/code_styleguides/workspace_paths.md` hard rule, test workspaces live in the project tree (not `%TEMP%`). - Write `manual_slop.toml` and `config.toml` to the workspace - Set up `SLOP_*` env vars to point at the workspace - Symlink `assets/` for fonts @@ -148,7 +148,7 @@ Output: `tests/logs//.txt` - Close the log file - Remove the temp workspace (with 5 retries for Windows file locks) -**Yield value**: `(process: subprocess.Popen, gui_script: str)` — but most tests just take the fixture and use the `ApiHookClient` directly. +**Yield value**: A `_LiveGuiHandle` object (see below). Most tests just take the fixture and use the `ApiHookClient` directly. **Usage pattern**: ```python @@ -180,9 +180,11 @@ The `live_gui` fixture yields a `_LiveGuiHandle` instead of a `(process, gui_scr **Backward compat:** The handle is iterable as `(process, gui_script)`, so existing `proc, _ = live_gui` patterns still work. -### `live_gui_workspace` fixture (tests/conftest.py:657) +### `live_gui_workspace` fixture (tests/conftest.py:727) -Yields the absolute path to the live_gui subprocess's workspace (a `tmp_path_factory.mktemp("live_gui_workspace")` directory in pytest's tmp dir). Tests that need to create files in the workspace should request this fixture instead of hardcoding `Path("tests/artifacts/live_gui_workspace")`. +Yields `live_gui.handle.workspace`, a `Path` to `tests/artifacts/live_gui_workspace_/` (computed at conftest import time). Tests that need to create files in the workspace should request this fixture instead of hardcoding the path. + +> **Important:** This is NOT `tmp_path_factory.mktemp()` despite the older documentation. The `workspace_paths.md` styleguide bans `tmp_path_factory` for test infrastructure workspaces (they'd live in `%TEMP%` and be unfindable). The path is in the project tree, under `tests/artifacts/`, and is shared across all xdist workers in a session (with a file-lock for spawn coordination). ```python def test_rag_setup(live_gui, live_gui_workspace): @@ -208,6 +210,87 @@ def test_rag_final_verify(live_gui): Use this for tests that are sensitive to controller state pollution from prior tests in the same session. The `test_rag_phase4_final_verify` test is marked this way because the 4 sims in `test_extended_sims.py` mutate controller state (provider, model, etc.) that would otherwise pollute the RAG test. +### `_reset_clean_baseline` autouse fixture + +The `clean_baseline` marker triggers an autouse `_reset_clean_baseline` fixture that calls `client.reset_session()` before the test starts. The `_handle_reset_session` controller method must clear **both** MMA state and RAG state to prevent cross-test pollution: +- **MMA**: `mma_tier_usage`, `mma_status`, `active_tier` +- **RAG**: `rag_engine` (if disabled), `rag_config` (reset to a fresh `RAGConfig()`, not `None` — `rag_config=None` makes all `rag_*` setters no-op because they check `if self.rag_config:`) + +See `tests/test_reset_session_clears_mma_and_rag.py` for the regression test that locked this contract in. + +### Watchdog and Hang Bounding + +The conftest's hang-bounding is **signal-based**, not naive `time.sleep` + `os._exit`: + +| Watchdog | Timeout | Trigger | Action | +|---|---|---|---| +| **Smart watchdog** (`_smart_watchdog_exit`) | 900s (15 min) | Waits for `_pytest_finished_event` | On signal: 5s grace, then `os._exit(0)`. On timeout: `os._exit(2)` (hard fail). | +| **Unconditional watchdog** (`_unconditional_watchdog_exit`) | 900s (15 min, longer safety net) | Same event | Same action. Catches the case where pytest is so slow the smart expires first. | + +`_pytest_finished_event` is set in two hooks for redundancy: `pytest_terminal_summary` (primary — after the summary is printed) and `pytest_unconfigure` (fallback). The prior daemon-thread approach (30s `os._exit(0)`) was removed because it killed batches mid-test on Windows (daemon threads are not auto-killed by the interpreter) and hid pytest's exit code from `run_tests_batched.py`. See `conductor/workflow.md` for the broader rationale. + +### Chroma Cache Path and Cross-Test Pollution + +The chroma cache lives at `tests/artifacts/.slop_cache/chroma_/`, **NOT** the per-run `live_gui_workspace_` subdir. This is because `active_project_root = Path(active_project_path).parent` and some test setups produce a trailing slash on the project path, placing the cache one level higher than expected. + +**Implication for tests:** Tests that interact with RAG should pre-clean the chroma cache to avoid persistent state from prior tests in the batch. `tests/test_rag_phase4_final_verify.py` does this: + +```python +def test_phase4_final_verify(live_gui): + # Wipe any stale chroma from prior batched runs + cache = Path("tests/artifacts/.slop_cache/chroma_test_final_verify") + if cache.exists(): + shutil.rmtree(cache, ignore_errors=True) + # ... rest of test +``` + +Without this cleanup, a prior batched run with a different embedding provider (e.g., Gemini 3072-dim vs local 384-dim) can leave a corrupt collection that fails the next test's `search()` with a dim-mismatch error. The `_validate_collection_dim()` mechanism in `RAGEngine` also auto-recovers (see [guide_rag.md](guide_rag.md#dimension-mismatch-protection)) but pre-cleaning is faster and avoids the stderr warning. + +### xdist Worker Coordination and Stale Lock Demotion + +The `live_gui` fixture uses an `O_EXCL`-atomic file lock at `tests/artifacts/live_gui_workspace_*/.live_gui_owner.lock`: + +1. Each xdist worker reads its ID from `PYTEST_XDIST_WORKER` env var (e.g., `gw0`, `gw1`, ...). +2. The first worker to acquire the file lock becomes the **owner** and spawns `sloppy.py --enable-test-hooks`. +3. Other workers become **clients** and wait for the hook server to respond on `127.0.0.1:8999`, then yield a `_LiveGuiHandle(process=None, ...)` (null process — they share the owner's subprocess via the hook API). +4. If the owner worker's hook server is **not** up when a client acquires the lock (e.g., owner crashed or hasn't started), the client demotes itself: removes the stale lock and re-tries as the owner. This prevents one crashed owner from blocking all 16 workers. + +### Required Test Dependencies Gate + +`_check_required_test_dependencies()` runs in `pytest_configure` and raises `pytest.UsageError` at session start if a required test dep is missing: + +| Module | Package | Fix command | +|---|---|---| +| `sentence_transformers` | `sentence-transformers` | `uv sync --extra local-rag` | + +This is a regression gate for the 2026-06-09 incident where a fresh `uv sync` (without `--extra local-rag`) produced a confusing `rag_status = error: ... Install with manual_slop[local-rag]` failure inside the live_gui subprocess. The gate fails fast with the exact fix command instead. To add a new required test dep, append `(module_name, package_name)` to `_REQUIRED_TEST_IMPORTS` in `tests/conftest.py`. + +### MMA and RAG State in `reset_session()` + +`controller._handle_reset_session()` must clear 5 specific state buckets to prevent cross-test pollution: + +| Bucket | What gets reset | Why | +|---|---|---| +| `mma_tier_usage` | Pre-populated to full default shape (input, output, provider, model, tool_preset) — NOT empty dict | Downstream `_flush_to_project` does `d["model"]`; empty dict causes `KeyError`. | +| `mma_status` | `None` | Status machine reset. | +| `active_tier` | `None` | Current tier cleared. | +| `rag_engine` | `None` if RAG disabled; unchanged if enabled (teardown is expensive) | Avoids 30s+ chroma re-init per test. | +| `rag_config` | Fresh `RAGConfig()` default (not `None`) | All `rag_*` setters check `if self.rag_config:` and become no-ops if None. | + +The regression tests are in `tests/test_reset_session_clears_mma_and_rag.py` (poll-for-state pattern, not `time.sleep`). + +### `_LiveGuiHandle` Indexing (`__getitem__`) + +In addition to the `__iter__` tuple-unpacking backward compat, `_LiveGuiHandle` supports `__getitem__` indexing: + +| Expression | Returns | +|---|---| +| `handle[0]` | `process` | +| `handle[1]` | `gui_script` | +| `handle[n]` for `n >= 2` | `IndexError` | + +Both patterns are valid and equivalent. New code should prefer the named attributes (`.process`, `.gui_script`, `.workspace`) for clarity; the indexing is for backward compat with the old tuple-yielding fixture. + --- ## Test Categories @@ -590,21 +673,26 @@ uv run pytest --lf --- -## The `Audit Script` +## The `Audit Scripts` -`scripts/check_test_toml_paths.py` greps `tests/` for direct `./.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. +The project has 4 audit scripts that enforce static conventions. They run as pre-commit/CI gates and exit non-zero on regression. -**Run it**: -```bash -python scripts/check_test_toml_paths.py -``` +| Script | Enforces | Run command | +|---|---|---| +| `scripts/check_test_toml_paths.py` | No real-TOML references in `tests/` (must be sandboxed) | `python scripts/check_test_toml_paths.py` | +| `scripts/audit_main_thread_imports.py` | Main-thread-purity invariant: the main thread (entering `immapp.run()`) never imports a module heavier than `imgui_bundle` + lean `gui_2` skeleton. Heavy SDKs (`google.genai`, `anthropic`, `openai`, `fastapi`) are lazy-only. | `python scripts/audit_main_thread_imports.py` | +| `scripts/audit_weak_types.py` | Type-alias convention: 430 weak `dict[str, Any]` / `list[dict[...]]` / `Tuple[...]` types across 6 high-traffic files were replaced with named `TypeAlias`es in `src/type_aliases.py`. The audit enforces the convention going forward. | `python scripts/audit_weak_types.py` | +| `scripts/audit_no_models_config_io.py` | `AppController` is the single source of truth for config I/O; direct calls to `models.save_config` / `models.load_config` in `src/` are forbidden. | `python scripts/audit_no_models_config_io.py` | -**Expected output**: -``` -OK: No tests reference real TOML files. -``` +**Per-script details:** -If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`. +**`check_test_toml_paths.py`** greps `tests/` for direct `./.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`. + +**`audit_main_thread_imports.py`** (added in `startup_speedup_20260606`) enforces the main-thread-purity invariant. It scans `src/` for `import` statements that pull in heavy SDKs at module level. Each violation is a site that will re-introduce a multi-second startup cost. The startup_speedup track reduced violations from 67 to 62 (5 fixed; the remaining 62 are large refactors tracked in future work). + +**`audit_weak_types.py`** (added in `data_structure_strengthening_20260606`) enforces the type-alias convention. The baseline file `scripts/audit_weak_types.baseline.json` records the post-refactor weak-type count (~60 after 86% reduction from 430 in 6 high-traffic files). New violations exit 1; the baseline lets you audit the delta. + +**`audit_no_models_config_io.py`** enforces the "AppController is the sole config owner" rule from `conductor/code_styleguides/config_state_owner.md`. ---