manual_slop/docs/guide_testing.md

# Testing Guide

[Top](../Readme.md) | [Architecture](guide_architecture.md) | [Simulations](guide_simulations.md) | [Workflow](../conductor/workflow.md)

---

## Overview

Manual Slop has **251 test files** in `tests/` covering every subsystem. The test infrastructure is designed around four principles:

1. **No real I/O during tests** — every test gets a sandboxed workspace via the `isolate_workspace` autouse fixture.
2. **No real AI calls** — tests use mock providers, reset session state, and never hit the network.
3. **GUI tests launch a real app** — the `live_gui` session fixture starts `sloppy.py --enable-test-hooks` so integration tests can drive the actual app via the Hook API.
4. **Tests are categorized by marker** — unit, integration, strict, clean_install, docker — so CI can opt in to expensive tests.

This guide is the canonical reference for how the test suite is structured and how to add new tests.

---

## Test File Layout

```
tests/
├── conftest.py                    # Session-wide fixtures (live_gui, isolate_workspace, etc.)
├── conftest.py is the canonical source
├── test_*.py                      # 251 test files, named `test_<topic>_<aspect>.py`
├── *_sim.py                      # Integration tests using the live_gui fixture
├── test_clean_install.py          # Opt-in: clones the repo to tmp and verifies hooks
├── test_docker_build.py          # Opt-in: builds and runs the Docker image
├── test_arch_boundary_phase1.py   # Architectural boundary tests
├── test_enforce_no_real_toml.py   # Meta-test for the enforcer fixture
├── artifacts/                    # Git-ignored; test output
├── logs/                          # Git-ignored; live_gui log files
└── mock_concurrent_mma.py         # Mock providers for MMA tests
```

**Naming conventions:**
- `test_*.py` — pytest collection
- `*_sim.py` — integration test (uses `live_gui`)
- `*_e2e.py` — end-to-end test (real processes, opt-in via env var)
- `test_<area>_<aspect>.py` — single aspect of an area, e.g., `test_ai_client_cli.py`

---

## The `conftest.py` Fixtures

The `tests/conftest.py` file defines 7 fixtures. They are listed below in the order pytest applies them (autouse first, then function-scoped, then session-scoped).

### Autouse Fixtures (Run Before Every Test)

#### `isolate_workspace` (line 70)

**Purpose**: Give every test a fresh, isolated workspace so it cannot pollute the user's real `manual_slop.toml`, `presets.toml`, etc.

**Mechanism**:
1. Creates a temp directory via `tmp_path_factory.mktemp("isolated_workspace")`
2. Writes a fresh `config.toml` to the temp dir
3. Sets `SLOP_CONFIG`, `SLOP_GLOBAL_PRESETS`, `SLOP_GLOBAL_TOOL_PRESETS`, `SLOP_GLOBAL_PERSONAS`, `SLOP_GLOBAL_WORKSPACE_PROFILES` env vars to point at the temp dir
4. The app reads these env vars on startup; the test sees an isolated world

**Verification**: `python scripts/check_test_toml_paths.py` exits 0 (no test references real TOMLs).

#### `reset_paths` (line 95)

**Purpose**: Reset the `src.paths` global state before and after each test.

**Mechanism**: Calls `paths.reset_resolved()` so path resolution re-evaluates on the next access.

#### `reset_ai_client` (line 107)

**Purpose**: Prevent `ai_client` state from leaking between tests.

**Mechanism**:
1. Calls `ai_client.reset_session()`
2. Clears callback hooks (`confirm_and_run_callback`, `comms_log_callback`, `tool_log_callback`)
3. Clears all event listeners
4. Resets provider to `("gemini", "gemini-2.5-flash-lite")`
5. Resets MCP client state via `mcp_client.configure([], [])`

### Function-Scoped Fixtures (Opt-in)

#### `vlogger` (line 131)

**Purpose**: Provide a `VerificationLogger` instance for structured diagnostic logging.

**Usage**:
```python
def test_my_thing(vlogger):
    vlogger.log_state("Field", "before_value", "after_value")
    # ... test logic ...
    vlogger.finalize("Test Title", "PASS", "result message")
```

Output: `tests/logs/<timestamp>/<script_name>.txt`

#### `kill_process_tree` (function, line 138)

**Purpose**: Robustly kill a process and all its children. Used by `live_gui` for cleanup, but available to any test.

**Mechanism**:
- Windows: `taskkill /F /T /PID <pid>` (the `/T` flag is critical — kills the whole tree)
- Unix: `os.killpg(os.getpgid(pid), SIGKILL)` (kills the process group)

#### `mock_app` (line 157)

**Purpose**: Create an `App` instance with all external side effects mocked. For unit tests that need the App but not the GUI loop.

**Mocks applied**:
- `src.models.load_config` → returns a default config
- `src.gui_2.project_manager`
- `src.gui_2.session_logger`
- `src.gui_2.immapp.run` (prevents the actual render loop from starting)
- `src.app_controller.AppController._load_active_project`
- `src.app_controller.AppController._fetch_models`
- `App._load_fonts`
- `App._post_init`
- `src.app_controller.AppController._prune_old_logs`
- `src.app_controller.AppController.start_services`
- `src.app_controller.AppController._init_ai_and_hooks`
- `src.performance_monitor.PerformanceMonitor`

**Cleanup**: Shuts down the controller after the test.

#### `app_instance` (line 190)

**Purpose**: Same as `mock_app` but with a slightly different mocking surface (the same mocks but used in `test_gui_phase4.py` and `test_token_viz.py` historically). Both are equivalent for most purposes.

### Session-Scoped Fixtures (One Per Test Run)

#### `live_gui` (line 227)

**Purpose**: Start `sloppy.py --enable-test-hooks` for the entire test session. Integration tests use this to drive the real GUI via the Hook API.

**Lifecycle**:
1. **Setup (once per session)**:
   - Compute the per-run workspace path: `tests/artifacts/live_gui_workspace_<timestamp>/` (where `<timestamp>` is `datetime.now().strftime("%Y%m%d_%H%M%S")` at conftest import time). Per the `conductor/code_styleguides/workspace_paths.md` hard rule, test workspaces live in the project tree (not `%TEMP%`).
   - Write `manual_slop.toml` and `config.toml` to the workspace
   - Set up `SLOP_*` env vars to point at the workspace
   - Symlink `assets/` for fonts
   - Launch `sloppy.py --enable-test-hooks` via `subprocess.Popen`
   - Poll `GET /status` for up to 15 seconds (waiting for the HookServer to start)
   - On failure: `pytest.fail()` (kills the process tree, aborts the session)
2. **Yield**: tests run
3. **Teardown (once per session)**:
   - Call `ApiHookClient.reset_session()` to clear GUI state
   - Kill the process tree (Windows: `taskkill /F /T`, Unix: `SIGKILL`)
   - Wait 0.5s for file handles to close
   - Close the log file
   - Remove the temp workspace (with 5 retries for Windows file locks)

**Yield value**: A `_LiveGuiHandle` object (see below). Most tests just take the fixture and use the `ApiHookClient` directly.

**Usage pattern**:
```python
def test_my_thing(live_gui):
    client = ApiHookClient()  # connects to localhost:8999
    client.click("btn_id")
    time.sleep(0.5)
    assert client.get_value("show_thing") is True
```

---

## Per-test Subprocess Resilience (2026-06-09)

Added in `test_infrastructure_hardening_20260609` track. These three mechanisms address the "subprocess state pollution" and "controller state pollution" failure modes that caused batch regressions.

### `_LiveGuiHandle` class (tests/conftest.py:393)

The `live_gui` fixture yields a `_LiveGuiHandle` instead of a `(process, gui_script)` tuple. The handle exposes:

| Attribute/Method | Purpose |
|---|---|
| `process` | The `subprocess.Popen` for the sloppy.py subprocess |
| `gui_script` | Absolute path to sloppy.py |
| `workspace` | Absolute path to the subprocess's working directory (pytest tmp dir) |
| `is_alive()` | True if the subprocess is running |
| `ensure_alive()` | No-op stub — increments `respawn_count` if dead, does not respawn (deferred) |
| `respawn_count` | Number of times the subprocess was found dead |

**Backward compat:** The handle is iterable as `(process, gui_script)`, so existing `proc, _ = live_gui` patterns still work.

### `live_gui_workspace` fixture (tests/conftest.py:727)

Yields `live_gui.handle.workspace`, a `Path` to `tests/artifacts/live_gui_workspace_<timestamp>/` (computed at conftest import time). Tests that need to create files in the workspace should request this fixture instead of hardcoding the path.

> **Important:** This is NOT `tmp_path_factory.mktemp()` despite the older documentation. The `workspace_paths.md` styleguide bans `tmp_path_factory` for test infrastructure workspaces (they'd live in `%TEMP%` and be unfindable). The path is in the project tree, under `tests/artifacts/`, and is shared across all xdist workers in a session (with a file-lock for spawn coordination).

```python
def test_rag_setup(live_gui, live_gui_workspace):
    test_file = live_gui_workspace / "my_input.txt"
    test_file.write_text("hello")
    # ... configure RAG, index, query
```

### `_check_live_gui_health` autouse fixture (tests/conftest.py:650)

Runs before every test that uses `live_gui`. Calls `handle.ensure_alive()` to detect subprocess death between tests. If the subprocess died, the counter increments (but the subprocess is not respawned — see `ensure_alive` above).

### `clean_baseline` marker

Opt-in marker for tests that need a fresh controller state. Tests marked with `@pytest.mark.clean_baseline` get `/api/reset_session` called before they start, ensuring no pollution from prior tests.

```python
@pytest.mark.clean_baseline
def test_rag_final_verify(live_gui):
    # ai_input is guaranteed empty, controller is in a known state
    ...
```

Use this for tests that are sensitive to controller state pollution from prior tests in the same session. The `test_rag_phase4_final_verify` test is marked this way because the 4 sims in `test_extended_sims.py` mutate controller state (provider, model, etc.) that would otherwise pollute the RAG test.

### `_reset_clean_baseline` autouse fixture

The `clean_baseline` marker triggers an autouse `_reset_clean_baseline` fixture that calls `client.reset_session()` before the test starts. The `_handle_reset_session` controller method must clear **both** MMA state and RAG state to prevent cross-test pollution:
- **MMA**: `mma_tier_usage`, `mma_status`, `active_tier`
- **RAG**: `rag_engine` (if disabled), `rag_config` (reset to a fresh `RAGConfig()`, not `None` — `rag_config=None` makes all `rag_*` setters no-op because they check `if self.rag_config:`)

See `tests/test_reset_session_clears_mma_and_rag.py` for the regression test that locked this contract in.

### Watchdog and Hang Bounding

The conftest's hang-bounding is **signal-based**, not naive `time.sleep` + `os._exit`:

| Watchdog | Timeout | Trigger | Action |
|---|---|---|---|
| **Smart watchdog** (`_smart_watchdog_exit`) | 900s (15 min) | Waits for `_pytest_finished_event` | On signal: 5s grace, then `os._exit(0)`. On timeout: `os._exit(2)` (hard fail). |
| **Unconditional watchdog** (`_unconditional_watchdog_exit`) | 900s (15 min, longer safety net) | Same event | Same action. Catches the case where pytest is so slow the smart expires first. |

`_pytest_finished_event` is set in two hooks for redundancy: `pytest_terminal_summary` (primary — after the summary is printed) and `pytest_unconfigure` (fallback). The prior daemon-thread approach (30s `os._exit(0)`) was removed because it killed batches mid-test on Windows (daemon threads are not auto-killed by the interpreter) and hid pytest's exit code from `run_tests_batched.py`. See `conductor/workflow.md` for the broader rationale.

### Chroma Cache Path and Cross-Test Pollution

The chroma cache lives at `tests/artifacts/.slop_cache/chroma_<project>/`, **NOT** the per-run `live_gui_workspace_<timestamp>` subdir. This is because `active_project_root = Path(active_project_path).parent` and some test setups produce a trailing slash on the project path, placing the cache one level higher than expected.

**Implication for tests:** Tests that interact with RAG should pre-clean the chroma cache to avoid persistent state from prior tests in the batch. `tests/test_rag_phase4_final_verify.py` does this:

```python
def test_phase4_final_verify(live_gui):
    # Wipe any stale chroma from prior batched runs
    cache = Path("tests/artifacts/.slop_cache/chroma_test_final_verify")
    if cache.exists():
        shutil.rmtree(cache, ignore_errors=True)
    # ... rest of test
```

Without this cleanup, a prior batched run with a different embedding provider (e.g., Gemini 3072-dim vs local 384-dim) can leave a corrupt collection that fails the next test's `search()` with a dim-mismatch error. The `_validate_collection_dim()` mechanism in `RAGEngine` also auto-recovers (see [guide_rag.md](guide_rag.md#dimension-mismatch-protection)) but pre-cleaning is faster and avoids the stderr warning.

### xdist Worker Coordination and Stale Lock Demotion

The `live_gui` fixture uses an `O_EXCL`-atomic file lock at `tests/artifacts/live_gui_workspace_*/.live_gui_owner.lock`:

1. Each xdist worker reads its ID from `PYTEST_XDIST_WORKER` env var (e.g., `gw0`, `gw1`, ...).
2. The first worker to acquire the file lock becomes the **owner** and spawns `sloppy.py --enable-test-hooks`.
3. Other workers become **clients** and wait for the hook server to respond on `127.0.0.1:8999`, then yield a `_LiveGuiHandle(process=None, ...)` (null process — they share the owner's subprocess via the hook API).
4. If the owner worker's hook server is **not** up when a client acquires the lock (e.g., owner crashed or hasn't started), the client demotes itself: removes the stale lock and re-tries as the owner. This prevents one crashed owner from blocking all 16 workers.

### Required Test Dependencies Gate

`_check_required_test_dependencies()` runs in `pytest_configure` and raises `pytest.UsageError` at session start if a required test dep is missing:

| Module | Package | Fix command |
|---|---|---|
| `sentence_transformers` | `sentence-transformers` | `uv sync --extra local-rag` |

This is a regression gate for the 2026-06-09 incident where a fresh `uv sync` (without `--extra local-rag`) produced a confusing `rag_status = error: ... Install with manual_slop[local-rag]` failure inside the live_gui subprocess. The gate fails fast with the exact fix command instead. To add a new required test dep, append `(module_name, package_name)` to `_REQUIRED_TEST_IMPORTS` in `tests/conftest.py`.

### MMA and RAG State in `reset_session()`

`controller._handle_reset_session()` must clear 5 specific state buckets to prevent cross-test pollution:

| Bucket | What gets reset | Why |
|---|---|---|
| `mma_tier_usage` | Pre-populated to full default shape (input, output, provider, model, tool_preset) — NOT empty dict | Downstream `_flush_to_project` does `d["model"]`; empty dict causes `KeyError`. |
| `mma_status` | `None` | Status machine reset. |
| `active_tier` | `None` | Current tier cleared. |
| `rag_engine` | `None` if RAG disabled; unchanged if enabled (teardown is expensive) | Avoids 30s+ chroma re-init per test. |
| `rag_config` | Fresh `RAGConfig()` default (not `None`) | All `rag_*` setters check `if self.rag_config:` and become no-ops if None. |

The regression tests are in `tests/test_reset_session_clears_mma_and_rag.py` (poll-for-state pattern, not `time.sleep`).

### `_LiveGuiHandle` Indexing (`__getitem__`)

In addition to the `__iter__` tuple-unpacking backward compat, `_LiveGuiHandle` supports `__getitem__` indexing:

| Expression | Returns |
|---|---|
| `handle[0]` | `process` |
| `handle[1]` | `gui_script` |
| `handle[n]` for `n >= 2` | `IndexError` |

Both patterns are valid and equivalent. New code should prefer the named attributes (`.process`, `.gui_script`, `.workspace`) for clarity; the indexing is for backward compat with the old tuple-yielding fixture.

---

## Test Categories

### 1. Unit Tests (no fixtures, fast)

Pure functions tested in isolation. No app, no GUI, no subprocess. Run in <100ms each.

**Examples**:
- `tests/test_command_palette.py` — fuzzy matcher, command registry
- `tests/test_fuzzy_anchor.py` — anchor slice algorithm
- `tests/test_paths.py` — path resolution
- `tests/test_token_usage.py` — token tracking
- `tests/test_cost_tracker.py` — cost estimation

**Pattern**:
```python
def test_my_unit():
    result = my_function(input)
    assert result == expected
```

### 2. Integration Tests (use `live_gui`, slow)

Drive the actual app via the Hook API. Run in 1-10 seconds each (real subprocess).

**Examples**:
- `tests/test_saved_presets_sim.py` — preset switching via the GUI
- `tests/test_command_palette_sim.py` — palette toggle, navigation
- `tests/test_mma_concurrent_tracks_sim.py` — multi-track MMA
- `tests/test_workspace_profiles_sim.py` — workspace profile save/load
- `tests/test_gui_dag_beads.py` — Beads DAG visualization

**Pattern**:
```python
def test_my_integration(live_gui):
    client = ApiHookClient()
    client.push_event("custom_callback", {
        "callback": "_my_method",
        "args": [arg1, arg2],
    })
    time.sleep(0.5)
    assert client.get_value("result") == expected
```

### 3. Mock App Tests (use `mock_app` or `app_instance`, fast)

Need an App instance but not the full render loop. Run in <500ms each.

**Examples**:
- `tests/test_text_viewer.py` — text viewer state updates
- `tests/test_patch_modal.py` — patch modal workflow
- `tests/test_gui2_events.py` — event subscriptions

**Pattern**:
```python
def test_my_thing(mock_app):
    mock_app.some_attr = "test_value"
    mock_app._do_something()
    assert mock_app.some_attr == "expected"
```

### 4. Headless Tests (no GUI, real services)

Test the FastAPI/headless service directly via the Hook API. No subprocess.

**Examples**:
- `tests/test_headless_service.py` — service lifecycle
- `tests/test_headless_verification.py` — full run with QA interceptor

### 5. Opt-in Tests (gated by env var)

Slow or network-dependent tests that don't run by default. Set the env var to enable.

| Test File | Marker | Env Var | Purpose |
|---|---|---|---|
| `tests/test_clean_install.py` | `@pytest.mark.clean_install` | `RUN_CLEAN_INSTALL_TEST=1` | Clones the repo to tmp and verifies the hook API |
| `tests/test_docker_build.py` | `@pytest.mark.docker` | `RUN_DOCKER_TEST=1` | Builds and runs the Docker image |

**Running opt-in tests**:
```bash
RUN_CLEAN_INSTALL_TEST=1 uv run pytest tests/test_clean_install.py -v
RUN_DOCKER_TEST=1 uv run pytest tests/test_docker_build.py -v
```

---

## Markers

Defined in `pyproject.toml`:

```toml
[tool.pytest.ini_options]
markers = [
    "integration: marks tests as integration tests (requires live GUI)",
]
```

**Adding a new marker**: add it to the list. Pytest will warn if a marker is used but not registered.

**Filtering by marker**:
```bash
uv run pytest -m integration         # Only integration tests
uv run pytest -m "not integration"   # Skip integration tests
uv run pytest -m clean_install      # Opt-in clean install tests
```

---

## The Hook API (For Integration Tests)

The live GUI exposes a Hook API on `http://127.0.0.1:8999` when launched with `--enable-test-hooks`. The `ApiHookClient` (`src/api_hook_client.py`) is the Python wrapper.

### Key Methods

```python
client = ApiHookClient()  # connects to localhost:8999 by default

# Click a button
client.click("btn_reset")

# Set a widget value
client.set_value("ui_ai_input", "Hello world")

# Push a generic GUI task
client.push_event("custom_callback", {
    "callback": "_my_method",
    "args": [arg1, arg2],
})

# Get a value (gettable field)
value = client.get_value("show_command_palette")

# Wait for an event
event = client.wait_for_event("ai_response", timeout=10)

# Reset the session
client.reset_session()
```

### `predefined_callbacks` Pattern

To make a test invoke an App method via the hook, register it in `gui_2.py`:

```python
self.controller._predefined_callbacks['_my_method'] = self._my_method
self.controller._gettable_fields['show_thing'] = 'show_thing'
```

The test can then invoke `_my_method` via:
```python
client.push_event("custom_callback", {
    "callback": "_my_method",
    "args": [],
})
```

This pattern is how the Command Palette's `_toggle_command_palette` is exposed for tests (since the keyboard shortcut can't be simulated via the hook).

---

## Common Patterns

### Testing a Pure Function

```python
def test_my_function():
    from src.mymodule import my_function
    result = my_function("input", 42)
    assert result == "expected"
```

### Testing with a Mock App

```python
from unittest.mock import MagicMock

def test_with_mock():
    app = MagicMock()
    app.some_attr = "test"
    from src.mymodule import do_thing
    do_thing(app)
    app.some_method.assert_called_once()
```

### Testing via live_gui

```python
import time
import pytest
from src.api_hook_client import ApiHookClient

def test_via_gui(live_gui):
    client = ApiHookClient()
    client.push_event("custom_callback", {
        "callback": "_some_method",
        "args": ["value"],
    })
    time.sleep(0.5)
    assert client.get_value("result") == "expected"
```

### Testing an Exception Path

```python
import pytest

def test_raises():
    from src.mymodule import do_thing
    with pytest.raises(ValueError, match="expected message"):
        do_thing(bad_input)
```

### Parametrized Tests

```python
import pytest

@pytest.mark.parametrize("input,expected", [
    ("a", 1),
    ("b", 2),
    ("c", 3),
])
def test_my_parametrized(input, expected):
    assert my_function(input) == expected
```

---

## Test Configuration

### `pyproject.toml`

```toml
[tool.pytest.ini_options]
asyncio_mode = "strict"
markers = [
    "integration: marks tests as integration tests (requires live GUI)",
]
asyncio_default_fixture_loop_scope = None
asyncio_default_test_loop_scope = "function"
```

`asyncio_mode = "strict"` means async tests need explicit `@pytest.mark.asyncio`. This is intentional — most Manual Slop tests are synchronous.

### Coverage

Run with coverage:
```bash
uv run pytest tests/ --cov=src --cov-report=html
```

Open `htmlcov/index.html` in a browser. Target: >80% coverage for new code (per the project's quality gates).

---

## Running Tests

### All Tests

```bash
uv run pytest tests/ -v
```

**Warning**: This runs 251 tests including slow `live_gui` integration tests. Total runtime: 5-10 minutes.

### Specific Test File

```bash
uv run pytest tests/test_command_palette.py -v
```

### Specific Test

```bash
uv run pytest tests/test_command_palette.py::test_fuzzy_match_prefix_ranks_first -v
```

### Batched Run (Categorized)

```bash
uv run python scripts/run_tests_batched.py
```

This runs the new categorized batcher: 6 fixture-class-isolated tiers (opt-in skipped by default, unit with xdist, mock_app, live_gui in one session, headless, performance). Each tier prints a summary line. Use `--plan` to see the batch plan without running; `--audit` to list unclassified files; `--tiers 1,2` to limit which tiers run.

See `conductor/tracks/test_batching_refactor_20260606/spec.md` for the full design.

### By Marker

```bash
uv run pytest -m integration -v      # Only integration tests
uv run pytest -m "not integration"   # Skip integration tests
```

### With Stop on First Failure

```bash
uv run pytest tests/ -x -v
```

### With Timeout

```bash
uv run pytest tests/ --timeout=60 -v
```

---

## Adding a New Test

### For a Pure Function

1. Add tests to an existing `tests/test_<module>.py` file (if it exists) or create a new one
2. Use `def test_<thing>():` naming convention
3. No fixtures needed unless you're reading state
4. Verify it runs: `uv run pytest tests/test_<file>.py::test_<name> -v`

### For an Integration Test

1. Create or extend a `*_sim.py` file
2. Add `def test_<thing>(live_gui):` with the live_gui fixture
3. Use `ApiHookClient` to drive the GUI
4. If you need to invoke an App method that's not yet exposed, register it as a `_predefined_callbacks` entry in `gui_2.py`
5. Verify: `uv run pytest tests/test_<file>_sim.py::test_<name> -v`

### For an Opt-in Test (Clean Install / Docker)

1. Mark with `@pytest.mark.<marker_name>`
2. Gate the entire file with a skip if the env var isn't set
3. Add the marker to `pyproject.toml`'s `markers` list
4. Document the env var in the test file's docstring

---

## Debugging Failed Tests

### Verbose Output

```bash
uv run pytest tests/test_X.py -v -s
```

`-s` disables stdout/stderr capture so you can see print() output.

### Stop at First Failure

```bash
uv run pytest tests/test_X.py -x
```

### Enter PDB on Failure

```bash
uv run pytest tests/test_X.py --pdb
```

### Show Local Variables on Failure

```bash
uv run pytest tests/test_X.py -l
```

### Re-run Last Failed

```bash
uv run pytest --lf
```

### Common Failure Modes

| Symptom | Likely Cause | Fix |
|---|---|---|
| `ImportError` for a module | Missing dependency or 1-space indent issue | Check pyproject.toml; run `uv sync` |
| `live_gui` times out | Previous test left a process running | `taskkill /F /IM python.exe` to clean up |
| `get_value` returns `None` | Field not registered as gettable | Add to `self.controller._gettable_fields` in `gui_2.py` |
| `custom_callback` does nothing | Callback not registered | Add to `self.controller._predefined_callbacks` |
| `IM_ASSERT: Must call EndChild()` | Modal end_child/end pairing broken (usually from a buggy action) | Wrap actions in try/except; check for `imgui.end_child()` before `imgui.end()` |
| `pytest.fail` from `live_gui` startup | Hook server didn't start in 15s | Check `logs/gui_2_py_test.log` for crash |

---

## The `Audit Scripts`

The project has 4 audit scripts that enforce static conventions. They run as pre-commit/CI gates and exit non-zero on regression.

| Script | Enforces | Run command |
|---|---|---|
| `scripts/check_test_toml_paths.py` | No real-TOML references in `tests/` (must be sandboxed) | `python scripts/check_test_toml_paths.py` |
| `scripts/audit_main_thread_imports.py` | Main-thread-purity invariant: the main thread (entering `immapp.run()`) never imports a module heavier than `imgui_bundle` + lean `gui_2` skeleton. Heavy SDKs (`google.genai`, `anthropic`, `openai`, `fastapi`) are lazy-only. | `python scripts/audit_main_thread_imports.py` |
| `scripts/audit_weak_types.py` | Type-alias convention: 430 weak `dict[str, Any]` / `list[dict[...]]` / `Tuple[...]` types across 6 high-traffic files were replaced with named `TypeAlias`es in `src/type_aliases.py`. The audit enforces the convention going forward. | `python scripts/audit_weak_types.py` |
| `scripts/audit_no_models_config_io.py` | `AppController` is the single source of truth for config I/O; direct calls to `models.save_config` / `models.load_config` in `src/` are forbidden. | `python scripts/audit_no_models_config_io.py` |

**Per-script details:**

**`check_test_toml_paths.py`** greps `tests/` for direct `./<name>.toml` references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. If violations are found, migrate the offender to use `tmp_path` + `monkeypatch`.

**`audit_main_thread_imports.py`** (added in `startup_speedup_20260606`) enforces the main-thread-purity invariant. It scans `src/` for `import` statements that pull in heavy SDKs at module level. Each violation is a site that will re-introduce a multi-second startup cost. The startup_speedup track reduced violations from 67 to 62 (5 fixed; the remaining 62 are large refactors tracked in future work).

**`audit_weak_types.py`** (added in `data_structure_strengthening_20260606`) enforces the type-alias convention. The baseline file `scripts/audit_weak_types.baseline.json` records the post-refactor weak-type count (~60 after 86% reduction from 430 in 6 high-traffic files). New violations exit 1; the baseline lets you audit the delta.

**`audit_no_models_config_io.py`** enforces the "AppController is the sole config owner" rule from `conductor/code_styleguides/config_state_owner.md`.

---

## Test Data Flow

A typical test goes through this lifecycle:

```
Test starts
  ├─> isolate_workspace (autouse)
  │     ├─> Creates tmp dir
  │     └─> Sets SLOP_* env vars
  │
  ├─> reset_paths (autouse)
  │     └─> paths.reset_resolved()
  │
  ├─> reset_ai_client (autouse)
  │     └─> Resets ai_client global state
  │
  ├─> (test body runs)
  │     ├─> If using live_gui: subprocess already running (session-scoped)
  │     ├─> Test makes API calls via ApiHookClient
  │     └─> Test asserts on returned values
  │
  └─> Teardown
        ├─> reset_paths runs again
        └─> (autouse) state cleanup
```

The `live_gui` session fixture runs once at the start of the test session and tears down once at the end. All tests in the session share the same `sloppy.py` process.

---

## Known Gotchas (2026-06-05)

### Authoring Robust `live_gui` Tests (Don't Assume Clean State)

`live_gui` is a **session-scoped** fixture. All tests in a session share the same `sloppy.py` subprocess. The subprocess is **not** restarted between tests; its internal state (Fonts, DisplaySize, internal caches, current theme, current workspace profile, current discussion, current MMA track) **accumulates** from the previous test.

**This is a test-authoring contract, not a fixture bug.** A test that "passes when run after test X" but "fails when run in isolation" is a fragile test. Robust `live_gui` tests must:

1. **Not assume clean state.** Before invoking an operation, explicitly verify the precondition via the Hook API (e.g. `client.get_value("show_my_window")`, `client.get_mma_status()`, `client.get_session()`). Do not assume a previous test set the state.
2. **Use the wait-for-ready pattern, not fixed sleeps.** `time.sleep(1)` is **not** enough for ImGui to stabilize in the first few render frames (use 3+ seconds, but better: use `wait_for_event` with a generous timeout, or poll `client.get_status()` until ImGui reports `ready`). Fixed sleeps are a code smell; if you reach for one, the right answer is almost always "poll a gettable field instead".
3. **Reset state explicitly if the test depends on it.** For tests that mutate state (e.g. "click button X"), reset the relevant state via Hook API in a `try/finally` so the next test starts from a known baseline. Alternatively, use a function-scoped helper that issues a `reset_session` callback before the test body.
4. **Test both in the full suite AND in isolation before merging.** If a test passes in the full suite but fails in isolation, the test is fragile — fix the test, don't add a "warmup" comment. Bisecting by `pytest path::test -k "filter"` or `pytest --collect-only --quiet` helps.
5. **Use `get_value`/`wait_for_event` to assert ready, not just to assert success.** Example:
   ```python
   def test_open_settings_modal(live_gui):
       client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
       # Wait for the modal to actually appear, not just for the click to dispatch
       assert client.get_value("show_settings_modal"), "settings modal did not open"
   ```
   The `get_value` poll doubles as a wait-for-ready AND a correctness assertion.

**Anti-pattern (fragile):**
```python
def test_open_settings_modal(live_gui):
    client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
    time.sleep(1)  # hope the modal opened
    assert some_cached_value["settings_open"] is True  # may be stale from a prior test
```

**Pattern (robust):**
```python
def test_open_settings_modal(live_gui):
    client.reset_session()  # function-scoped helper; Hook API reset callback
    client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
    assert client.get_value("show_settings_modal"), "settings modal did not open"
```

### Early-Render C-Level Crashes (Defer-Not-Catch Pattern)

`imgui.save_ini_settings_to_memory()` (and similar raw imgui calls that read internal state) will **crash the Python process at the C level** (`0xc0000005` access violation) if called before ImGui's internal state is fully initialized. This is **not catchable from Python** — `try/except Exception` cannot intercept native access violations.

Symptoms:
- The `sloppy.py` subprocess disappears without a Python traceback.
- The pytest output shows `pytest.fail("Hook server did not start in 15s")` (the subprocess died during startup).
- Windows Event Viewer shows `Faulting module: _imgui_bundle.cp311-win_amd64.pyd` with exception code `0xc0000005`.

**Fix pattern: defer-not-catch.** Track a one-shot "ready" flag in the instance state; return early on the first call, only invoking the C function on subsequent calls:

```python
def _capture_workspace_profile(self, name: str) -> models.WorkspaceProfile:
    if not getattr(self, "_ini_capture_ready", False):
        self._ini_capture_ready = True
        return models.WorkspaceProfile(name=name, docking_layout=b"", ...)
    ini = imgui.save_ini_settings_to_memory()
    return models.WorkspaceProfile(name=name, docking_layout=ini.encode("utf-8") if isinstance(ini, str) else ini, ...)
```

The first call (during initial startup) returns a safe empty profile and flips the flag; subsequent calls (when the user actually clicks "Save Profile") invoke the C function. The user's workflow is unaffected because the first call is non-blocking and the user cannot have clicked "Save Profile" before the GUI was fully rendered.

See `src/gui_2.py:601-606` for the canonical implementation. This pattern unblocks 4-5 live_gui tests that were crashing the GUI subprocess during the first render frames after `_capture_workspace_profile` was invoked by the test (typically via a `save_workspace_profile` Hook API callback).

**Sentinel type contract.** When implementing a defer-not-catch guard, the early-return sentinel value must match the type contract of the downstream consumer. For `WorkspaceProfile.ini_content: str` (in this codebase), the sentinel must be `""` (str), not `b""` (bytes) — `tomli_w` rejects bytes (`TypeError: Object of type 'bytes' is not TOML serializable`), and `imgui.load_ini_settings_from_memory(ini_data: str, ...)` also expects `str`. A previous version of this fix used `b""` and silently broke the save flow via a `TypeError` raised by `tomli_w.dump`; tests passed unit-test-wise but failed in the live_gui save+load round-trip. The fix was a 1-character change (`b""` → `""`). The regression test in `tests/test_workspace_profile_serialization.py` encodes this contract.

---

## Pattern: Narrow Test Paths vs. Kitchen-Sink Functions

**Anti-pattern: calling a kitchen-sink function.** A test that does `gui_2.render_main_interface(app_instance)` requires mocking 50+ imgui/imscope methods because `render_main_interface` dispatches to dozens of nested render functions. Adding a single mock for `imscope.window` (to return a tuple) just reveals the next un-mocked dependency (e.g. `imgui.begin` returning bool where a 2-tuple is expected). The test never reaches its assertion.

**Better pattern: test the narrow function.** Most render flows have a dedicated sub-function (e.g. `render_prior_session_view`, `render_preset_manager_window`, `render_theme_panel`). Refactor the test to call the narrow function directly with mocks scoped to what *that* function actually uses. Example outcome:

- `render_main_interface` test: 50+ mocks, ~6s runtime, flakiness on every un-mocked imgui call.
- `render_prior_session_view` test: 20 mocks, ~0.08s runtime, stable.

**When to refactor vs. add mocks:**
- If the test intent is "verify push/pop balance in the prior-session render path", call the narrow function.
- If the test intent is "verify the whole GUI render path is correct", accept the 50+ mock cost (and ensure all mocks are correct).

See the `prior_session_test_harden_20260605` plan in `docs/superpowers/plans/` for the concrete refactor example.

---

## Pattern: Indentation-Driven Method Visibility

**The bug:** A class method defined with the right intent (2-space indent) may be parsed as nested inside a previous function if indentation is off by even one space. The file "passes" syntactically (imports OK) but the method is **not** on the class — `hasattr(App, 'method_name')` returns `False`. Any production code that calls `app.method_name` falls through to `__getattr__`, which delegates to the controller (which also doesn't have the method), and a cryptic `AttributeError` is raised at runtime.

**How to detect:**
- Use AST to list all App methods: `uv run python -c "import ast; tree = ast.parse(open('src/gui_2.py').read()); [print(item.name) for n in ast.walk(tree) if isinstance(n, ast.ClassDef) and n.name == 'App' for item in n.body if isinstance(item, ast.FunctionDef)]"`.
- The skeleton via `manual-slop_py_get_skeleton` should show the method as a class member.

**How to fix:** Re-indent the affected method to 2-space class level. Run the failing test to confirm. See the `live_gui_test_hardening_v2_20260605` track in `conductor/tracks.md` for the concrete example (where `_capture_workspace_profile` was being parsed as nested inside `_apply_snapshot` due to a 1-space indentation drift after a cleanup commit).

---

## See Also

- **[guide_simulations.md](guide_simulations.md)** — Older guide focused on the Puppeteer pattern; still relevant for the test scenarios it documents
- **[guide_meta_boundary.md](guide_meta_boundary.md)** — Application vs Meta-Tooling domain separation; the test suite is in the Application domain
- **[guide_architecture.md](guide_architecture.md#the-task-pipeline-producer-consumer-synchronization)** — Threading model that the `live_gui` test fixture respects
- **`src/api_hook_client.py`** — The Python wrapper for the Hook API used in integration tests
- **`tests/conftest.py`** — The canonical source of all fixtures documented in this guide