Private

Public Access

Files

T

ed 5fa8a10ebf docs(testing): critical live_gui_workspace path fix + 8 new sections

CRITICAL fix:
- live_gui_workspace path: tmp_path_factory (banned) ->
  tests/artifacts/live_gui_workspace_<timestamp> (per-run timestamp)
  (per conductor/code_styleguides/workspace_paths.md)

8 new sections under 'Per-test Subprocess Resilience':
1. _reset_clean_baseline autouse fixture (mma_tier_usage +
   rag_config=default RAGConfig(), not None)
2. Watchdog and Hang Bounding (signal-based, 900s smart + 900s
   unconditional, replaces removed 30s daemon-thread)
3. Chroma Cache Path (tests/artifacts/.slop_cache/, parent-trailing-slash
   bug, pre-cleanup pattern in test_rag_phase4_final_verify)
4. xdist Worker Coordination (O_EXCL file lock, PYTEST_XDIST_WORKER,
   owner/client roles, stale lock demotion)
5. Required Test Dependencies Gate (sentence-transformers,
   uv sync --extra local-rag fix)
6. MMA and RAG State in reset_session() (5 buckets: mma_tier_usage
   pre-populated, rag_config fresh RAGConfig() not None)
7. _LiveGuiHandle __getitem__ (handle[0] / handle[1])

Expand 'Audit Script' -> 'Audit Scripts' (4 scripts total):
- check_test_toml_paths.py (existing)
- audit_main_thread_imports.py (startup_speedup)
- audit_weak_types.py (data_structure_strengthening)
- audit_no_models_config_io.py (config_state_owner styleguide)

2026-06-10 20:05:16 -04:00

37 KiB

Raw Blame History

Testing Guide

Top | Architecture | Simulations | Workflow

Overview

Manual Slop has 251 test files in tests/ covering every subsystem. The test infrastructure is designed around four principles:

No real I/O during tests — every test gets a sandboxed workspace via the isolate_workspace autouse fixture.
No real AI calls — tests use mock providers, reset session state, and never hit the network.
GUI tests launch a real app — the live_gui session fixture starts sloppy.py --enable-test-hooks so integration tests can drive the actual app via the Hook API.
Tests are categorized by marker — unit, integration, strict, clean_install, docker — so CI can opt in to expensive tests.

This guide is the canonical reference for how the test suite is structured and how to add new tests.

Test File Layout

tests/
├── conftest.py                    # Session-wide fixtures (live_gui, isolate_workspace, etc.)
├── conftest.py is the canonical source
├── test_*.py                      # 251 test files, named `test_<topic>_<aspect>.py`
├── *_sim.py                      # Integration tests using the live_gui fixture
├── test_clean_install.py          # Opt-in: clones the repo to tmp and verifies hooks
├── test_docker_build.py          # Opt-in: builds and runs the Docker image
├── test_arch_boundary_phase1.py   # Architectural boundary tests
├── test_enforce_no_real_toml.py   # Meta-test for the enforcer fixture
├── artifacts/                    # Git-ignored; test output
├── logs/                          # Git-ignored; live_gui log files
└── mock_concurrent_mma.py         # Mock providers for MMA tests

Naming conventions:

test_*.py — pytest collection
*_sim.py — integration test (uses live_gui)
*_e2e.py — end-to-end test (real processes, opt-in via env var)
test_<area>_<aspect>.py — single aspect of an area, e.g., test_ai_client_cli.py

The `conftest.py` Fixtures

The tests/conftest.py file defines 7 fixtures. They are listed below in the order pytest applies them (autouse first, then function-scoped, then session-scoped).

Autouse Fixtures (Run Before Every Test)

`isolate_workspace` (line 70)

Purpose: Give every test a fresh, isolated workspace so it cannot pollute the user's real manual_slop.toml, presets.toml, etc.

Mechanism:

Creates a temp directory via tmp_path_factory.mktemp("isolated_workspace")
Writes a fresh config.toml to the temp dir
Sets SLOP_CONFIG, SLOP_GLOBAL_PRESETS, SLOP_GLOBAL_TOOL_PRESETS, SLOP_GLOBAL_PERSONAS, SLOP_GLOBAL_WORKSPACE_PROFILES env vars to point at the temp dir
The app reads these env vars on startup; the test sees an isolated world

Verification: python scripts/check_test_toml_paths.py exits 0 (no test references real TOMLs).

`reset_paths` (line 95)

Purpose: Reset the src.paths global state before and after each test.

Mechanism: Calls paths.reset_resolved() so path resolution re-evaluates on the next access.

`reset_ai_client` (line 107)

Purpose: Prevent ai_client state from leaking between tests.

Mechanism:

Calls ai_client.reset_session()
Clears callback hooks (confirm_and_run_callback, comms_log_callback, tool_log_callback)
Clears all event listeners
Resets provider to ("gemini", "gemini-2.5-flash-lite")
Resets MCP client state via mcp_client.configure([], [])

Function-Scoped Fixtures (Opt-in)

`vlogger` (line 131)

Purpose: Provide a VerificationLogger instance for structured diagnostic logging.

Usage:

def test_my_thing(vlogger):
    vlogger.log_state("Field", "before_value", "after_value")
    # ... test logic ...
    vlogger.finalize("Test Title", "PASS", "result message")

Output: tests/logs/<timestamp>/<script_name>.txt

`kill_process_tree` (function, line 138)

Purpose: Robustly kill a process and all its children. Used by live_gui for cleanup, but available to any test.

Mechanism:

Windows: taskkill /F /T /PID <pid> (the /T flag is critical — kills the whole tree)
Unix: os.killpg(os.getpgid(pid), SIGKILL) (kills the process group)

`mock_app` (line 157)

Purpose: Create an App instance with all external side effects mocked. For unit tests that need the App but not the GUI loop.

Mocks applied:

src.models.load_config → returns a default config
src.gui_2.project_manager
src.gui_2.session_logger
src.gui_2.immapp.run (prevents the actual render loop from starting)
src.app_controller.AppController._load_active_project
src.app_controller.AppController._fetch_models
App._load_fonts
App._post_init
src.app_controller.AppController._prune_old_logs
src.app_controller.AppController.start_services
src.app_controller.AppController._init_ai_and_hooks
src.performance_monitor.PerformanceMonitor

Cleanup: Shuts down the controller after the test.

`app_instance` (line 190)

Purpose: Same as mock_app but with a slightly different mocking surface (the same mocks but used in test_gui_phase4.py and test_token_viz.py historically). Both are equivalent for most purposes.

Session-Scoped Fixtures (One Per Test Run)

`live_gui` (line 227)

Purpose: Start sloppy.py --enable-test-hooks for the entire test session. Integration tests use this to drive the real GUI via the Hook API.

Lifecycle:

Setup (once per session):
- Compute the per-run workspace path: tests/artifacts/live_gui_workspace_<timestamp>/ (where <timestamp> is datetime.now().strftime("%Y%m%d_%H%M%S") at conftest import time). Per the conductor/code_styleguides/workspace_paths.md hard rule, test workspaces live in the project tree (not %TEMP%).
- Write manual_slop.toml and config.toml to the workspace
- Set up SLOP_* env vars to point at the workspace
- Symlink assets/ for fonts
- Launch sloppy.py --enable-test-hooks via subprocess.Popen
- Poll GET /status for up to 15 seconds (waiting for the HookServer to start)
- On failure: pytest.fail() (kills the process tree, aborts the session)
Yield: tests run
Teardown (once per session):
- Call ApiHookClient.reset_session() to clear GUI state
- Kill the process tree (Windows: taskkill /F /T, Unix: SIGKILL)
- Wait 0.5s for file handles to close
- Close the log file
- Remove the temp workspace (with 5 retries for Windows file locks)

Yield value: A _LiveGuiHandle object (see below). Most tests just take the fixture and use the ApiHookClient directly.

Usage pattern:

def test_my_thing(live_gui):
    client = ApiHookClient()  # connects to localhost:8999
    client.click("btn_id")
    time.sleep(0.5)
    assert client.get_value("show_thing") is True

Per-test Subprocess Resilience (2026-06-09)

Added in test_infrastructure_hardening_20260609 track. These three mechanisms address the "subprocess state pollution" and "controller state pollution" failure modes that caused batch regressions.

`_LiveGuiHandle` class (tests/conftest.py:393)

The live_gui fixture yields a _LiveGuiHandle instead of a (process, gui_script) tuple. The handle exposes:

Attribute/Method	Purpose
`process`	The `subprocess.Popen` for the sloppy.py subprocess
`gui_script`	Absolute path to sloppy.py
`workspace`	Absolute path to the subprocess's working directory (pytest tmp dir)
`is_alive()`	True if the subprocess is running
`ensure_alive()`	No-op stub — increments `respawn_count` if dead, does not respawn (deferred)
`respawn_count`	Number of times the subprocess was found dead

Backward compat: The handle is iterable as (process, gui_script), so existing proc, _ = live_gui patterns still work.

`live_gui_workspace` fixture (tests/conftest.py:727)

Yields live_gui.handle.workspace, a Path to tests/artifacts/live_gui_workspace_<timestamp>/ (computed at conftest import time). Tests that need to create files in the workspace should request this fixture instead of hardcoding the path.

Important: This is NOT tmp_path_factory.mktemp() despite the older documentation. The workspace_paths.md styleguide bans tmp_path_factory for test infrastructure workspaces (they'd live in %TEMP% and be unfindable). The path is in the project tree, under tests/artifacts/, and is shared across all xdist workers in a session (with a file-lock for spawn coordination).

def test_rag_setup(live_gui, live_gui_workspace):
    test_file = live_gui_workspace / "my_input.txt"
    test_file.write_text("hello")
    # ... configure RAG, index, query

`_check_live_gui_health` autouse fixture (tests/conftest.py:650)

Runs before every test that uses live_gui. Calls handle.ensure_alive() to detect subprocess death between tests. If the subprocess died, the counter increments (but the subprocess is not respawned — see ensure_alive above).

`clean_baseline` marker

Opt-in marker for tests that need a fresh controller state. Tests marked with @pytest.mark.clean_baseline get /api/reset_session called before they start, ensuring no pollution from prior tests.

@pytest.mark.clean_baseline
def test_rag_final_verify(live_gui):
    # ai_input is guaranteed empty, controller is in a known state
    ...

Use this for tests that are sensitive to controller state pollution from prior tests in the same session. The test_rag_phase4_final_verify test is marked this way because the 4 sims in test_extended_sims.py mutate controller state (provider, model, etc.) that would otherwise pollute the RAG test.

`_reset_clean_baseline` autouse fixture

The clean_baseline marker triggers an autouse _reset_clean_baseline fixture that calls client.reset_session() before the test starts. The _handle_reset_session controller method must clear both MMA state and RAG state to prevent cross-test pollution:

MMA: mma_tier_usage, mma_status, active_tier
RAG: rag_engine (if disabled), rag_config (reset to a fresh RAGConfig(), not None — rag_config=None makes all rag_* setters no-op because they check if self.rag_config:)

See tests/test_reset_session_clears_mma_and_rag.py for the regression test that locked this contract in.

Watchdog and Hang Bounding

The conftest's hang-bounding is signal-based, not naive time.sleep + os._exit:

Watchdog	Timeout	Trigger	Action
Smart watchdog (`_smart_watchdog_exit`)	900s (15 min)	Waits for `_pytest_finished_event`	On signal: 5s grace, then `os._exit(0)`. On timeout: `os._exit(2)` (hard fail).
Unconditional watchdog (`_unconditional_watchdog_exit`)	900s (15 min, longer safety net)	Same event	Same action. Catches the case where pytest is so slow the smart expires first.

_pytest_finished_event is set in two hooks for redundancy: pytest_terminal_summary (primary — after the summary is printed) and pytest_unconfigure (fallback). The prior daemon-thread approach (30s os._exit(0)) was removed because it killed batches mid-test on Windows (daemon threads are not auto-killed by the interpreter) and hid pytest's exit code from run_tests_batched.py. See conductor/workflow.md for the broader rationale.

Chroma Cache Path and Cross-Test Pollution

The chroma cache lives at tests/artifacts/.slop_cache/chroma_<project>/, NOT the per-run live_gui_workspace_<timestamp> subdir. This is because active_project_root = Path(active_project_path).parent and some test setups produce a trailing slash on the project path, placing the cache one level higher than expected.

Implication for tests: Tests that interact with RAG should pre-clean the chroma cache to avoid persistent state from prior tests in the batch. tests/test_rag_phase4_final_verify.py does this:

def test_phase4_final_verify(live_gui):
    # Wipe any stale chroma from prior batched runs
    cache = Path("tests/artifacts/.slop_cache/chroma_test_final_verify")
    if cache.exists():
        shutil.rmtree(cache, ignore_errors=True)
    # ... rest of test

Without this cleanup, a prior batched run with a different embedding provider (e.g., Gemini 3072-dim vs local 384-dim) can leave a corrupt collection that fails the next test's search() with a dim-mismatch error. The _validate_collection_dim() mechanism in RAGEngine also auto-recovers (see guide_rag.md) but pre-cleaning is faster and avoids the stderr warning.

xdist Worker Coordination and Stale Lock Demotion

The live_gui fixture uses an O_EXCL-atomic file lock at tests/artifacts/live_gui_workspace_*/.live_gui_owner.lock:

Each xdist worker reads its ID from PYTEST_XDIST_WORKER env var (e.g., gw0, gw1, ...).
The first worker to acquire the file lock becomes the owner and spawns sloppy.py --enable-test-hooks.
Other workers become clients and wait for the hook server to respond on 127.0.0.1:8999, then yield a _LiveGuiHandle(process=None, ...) (null process — they share the owner's subprocess via the hook API).
If the owner worker's hook server is not up when a client acquires the lock (e.g., owner crashed or hasn't started), the client demotes itself: removes the stale lock and re-tries as the owner. This prevents one crashed owner from blocking all 16 workers.

Required Test Dependencies Gate

_check_required_test_dependencies() runs in pytest_configure and raises pytest.UsageError at session start if a required test dep is missing:

Module	Package	Fix command
`sentence_transformers`	`sentence-transformers`	`uv sync --extra local-rag`

This is a regression gate for the 2026-06-09 incident where a fresh uv sync (without --extra local-rag) produced a confusing rag_status = error: ... Install with manual_slop[local-rag] failure inside the live_gui subprocess. The gate fails fast with the exact fix command instead. To add a new required test dep, append (module_name, package_name) to _REQUIRED_TEST_IMPORTS in tests/conftest.py.

MMA and RAG State in `reset_session()`

controller._handle_reset_session() must clear 5 specific state buckets to prevent cross-test pollution:

Bucket	What gets reset	Why
`mma_tier_usage`	Pre-populated to full default shape (input, output, provider, model, tool_preset) — NOT empty dict	Downstream `_flush_to_project` does `d["model"]`; empty dict causes `KeyError`.
`mma_status`	`None`	Status machine reset.
`active_tier`	`None`	Current tier cleared.
`rag_engine`	`None` if RAG disabled; unchanged if enabled (teardown is expensive)	Avoids 30s+ chroma re-init per test.
`rag_config`	Fresh `RAGConfig()` default (not `None`)	All `rag_*` setters check `if self.rag_config:` and become no-ops if None.

The regression tests are in tests/test_reset_session_clears_mma_and_rag.py (poll-for-state pattern, not time.sleep).

`_LiveGuiHandle` Indexing (`getitem`)

In addition to the __iter__ tuple-unpacking backward compat, _LiveGuiHandle supports __getitem__ indexing:

Expression	Returns
`handle[0]`	`process`
`handle[1]`	`gui_script`
`handle[n]` for `n >= 2`	`IndexError`

Both patterns are valid and equivalent. New code should prefer the named attributes (.process, .gui_script, .workspace) for clarity; the indexing is for backward compat with the old tuple-yielding fixture.

Test Categories

1. Unit Tests (no fixtures, fast)

Pure functions tested in isolation. No app, no GUI, no subprocess. Run in <100ms each.

Examples:

tests/test_command_palette.py — fuzzy matcher, command registry
tests/test_fuzzy_anchor.py — anchor slice algorithm
tests/test_paths.py — path resolution
tests/test_token_usage.py — token tracking
tests/test_cost_tracker.py — cost estimation

Pattern:

def test_my_unit():
    result = my_function(input)
    assert result == expected

2. Integration Tests (use `live_gui`, slow)

Drive the actual app via the Hook API. Run in 1-10 seconds each (real subprocess).

Examples:

tests/test_saved_presets_sim.py — preset switching via the GUI
tests/test_command_palette_sim.py — palette toggle, navigation
tests/test_mma_concurrent_tracks_sim.py — multi-track MMA
tests/test_workspace_profiles_sim.py — workspace profile save/load
tests/test_gui_dag_beads.py — Beads DAG visualization

Pattern:

def test_my_integration(live_gui):
    client = ApiHookClient()
    client.push_event("custom_callback", {
        "callback": "_my_method",
        "args": [arg1, arg2],
    })
    time.sleep(0.5)
    assert client.get_value("result") == expected

3. Mock App Tests (use `mock_app` or `app_instance`, fast)

Need an App instance but not the full render loop. Run in <500ms each.

Examples:

tests/test_text_viewer.py — text viewer state updates
tests/test_patch_modal.py — patch modal workflow
tests/test_gui2_events.py — event subscriptions

Pattern:

def test_my_thing(mock_app):
    mock_app.some_attr = "test_value"
    mock_app._do_something()
    assert mock_app.some_attr == "expected"

4. Headless Tests (no GUI, real services)

Test the FastAPI/headless service directly via the Hook API. No subprocess.

Examples:

tests/test_headless_service.py — service lifecycle
tests/test_headless_verification.py — full run with QA interceptor

5. Opt-in Tests (gated by env var)

Slow or network-dependent tests that don't run by default. Set the env var to enable.

Test File	Marker	Env Var	Purpose
`tests/test_clean_install.py`	`@pytest.mark.clean_install`	`RUN_CLEAN_INSTALL_TEST=1`	Clones the repo to tmp and verifies the hook API
`tests/test_docker_build.py`	`@pytest.mark.docker`	`RUN_DOCKER_TEST=1`	Builds and runs the Docker image

Running opt-in tests:

RUN_CLEAN_INSTALL_TEST=1 uv run pytest tests/test_clean_install.py -v
RUN_DOCKER_TEST=1 uv run pytest tests/test_docker_build.py -v

Markers

Defined in pyproject.toml:

[tool.pytest.ini_options]
markers = [
    "integration: marks tests as integration tests (requires live GUI)",
]

Adding a new marker: add it to the list. Pytest will warn if a marker is used but not registered.

Filtering by marker:

uv run pytest -m integration         # Only integration tests
uv run pytest -m "not integration"   # Skip integration tests
uv run pytest -m clean_install      # Opt-in clean install tests

The Hook API (For Integration Tests)

The live GUI exposes a Hook API on http://127.0.0.1:8999 when launched with --enable-test-hooks. The ApiHookClient (src/api_hook_client.py) is the Python wrapper.

Key Methods

client = ApiHookClient()  # connects to localhost:8999 by default

# Click a button
client.click("btn_reset")

# Set a widget value
client.set_value("ui_ai_input", "Hello world")

# Push a generic GUI task
client.push_event("custom_callback", {
    "callback": "_my_method",
    "args": [arg1, arg2],
})

# Get a value (gettable field)
value = client.get_value("show_command_palette")

# Wait for an event
event = client.wait_for_event("ai_response", timeout=10)

# Reset the session
client.reset_session()

`predefined_callbacks` Pattern

To make a test invoke an App method via the hook, register it in gui_2.py:

self.controller._predefined_callbacks['_my_method'] = self._my_method
self.controller._gettable_fields['show_thing'] = 'show_thing'

The test can then invoke _my_method via:

client.push_event("custom_callback", {
    "callback": "_my_method",
    "args": [],
})

This pattern is how the Command Palette's _toggle_command_palette is exposed for tests (since the keyboard shortcut can't be simulated via the hook).

Common Patterns

Testing a Pure Function

def test_my_function():
    from src.mymodule import my_function
    result = my_function("input", 42)
    assert result == "expected"

Testing with a Mock App

from unittest.mock import MagicMock

def test_with_mock():
    app = MagicMock()
    app.some_attr = "test"
    from src.mymodule import do_thing
    do_thing(app)
    app.some_method.assert_called_once()

Testing via live_gui

import time
import pytest
from src.api_hook_client import ApiHookClient

def test_via_gui(live_gui):
    client = ApiHookClient()
    client.push_event("custom_callback", {
        "callback": "_some_method",
        "args": ["value"],
    })
    time.sleep(0.5)
    assert client.get_value("result") == "expected"

Testing an Exception Path

import pytest

def test_raises():
    from src.mymodule import do_thing
    with pytest.raises(ValueError, match="expected message"):
        do_thing(bad_input)

Parametrized Tests

import pytest

@pytest.mark.parametrize("input,expected", [
    ("a", 1),
    ("b", 2),
    ("c", 3),
])
def test_my_parametrized(input, expected):
    assert my_function(input) == expected

Test Configuration

`pyproject.toml`

[tool.pytest.ini_options]
asyncio_mode = "strict"
markers = [
    "integration: marks tests as integration tests (requires live GUI)",
]
asyncio_default_fixture_loop_scope = None
asyncio_default_test_loop_scope = "function"

asyncio_mode = "strict" means async tests need explicit @pytest.mark.asyncio. This is intentional — most Manual Slop tests are synchronous.

Coverage

Run with coverage:

uv run pytest tests/ --cov=src --cov-report=html

Open htmlcov/index.html in a browser. Target: >80% coverage for new code (per the project's quality gates).

Running Tests

All Tests

uv run pytest tests/ -v

Warning: This runs 251 tests including slow live_gui integration tests. Total runtime: 5-10 minutes.

Specific Test File

uv run pytest tests/test_command_palette.py -v

Specific Test

uv run pytest tests/test_command_palette.py::test_fuzzy_match_prefix_ranks_first -v

Batched Run (Categorized)

uv run python scripts/run_tests_batched.py

This runs the new categorized batcher: 6 fixture-class-isolated tiers (opt-in skipped by default, unit with xdist, mock_app, live_gui in one session, headless, performance). Each tier prints a summary line. Use --plan to see the batch plan without running; --audit to list unclassified files; --tiers 1,2 to limit which tiers run.

See conductor/tracks/test_batching_refactor_20260606/spec.md for the full design.

By Marker

uv run pytest -m integration -v      # Only integration tests
uv run pytest -m "not integration"   # Skip integration tests

With Stop on First Failure

uv run pytest tests/ -x -v

With Timeout

uv run pytest tests/ --timeout=60 -v

Adding a New Test

For a Pure Function

Add tests to an existing tests/test_<module>.py file (if it exists) or create a new one
Use def test_<thing>(): naming convention
No fixtures needed unless you're reading state
Verify it runs: uv run pytest tests/test_<file>.py::test_<name> -v

For an Integration Test

Create or extend a *_sim.py file
Add def test_<thing>(live_gui): with the live_gui fixture
Use ApiHookClient to drive the GUI
If you need to invoke an App method that's not yet exposed, register it as a _predefined_callbacks entry in gui_2.py
Verify: uv run pytest tests/test_<file>_sim.py::test_<name> -v

For an Opt-in Test (Clean Install / Docker)

Mark with @pytest.mark.<marker_name>
Gate the entire file with a skip if the env var isn't set
Add the marker to pyproject.toml's markers list
Document the env var in the test file's docstring

Debugging Failed Tests

Verbose Output

uv run pytest tests/test_X.py -v -s

-s disables stdout/stderr capture so you can see print() output.

Stop at First Failure

uv run pytest tests/test_X.py -x

Enter PDB on Failure

uv run pytest tests/test_X.py --pdb

Show Local Variables on Failure

uv run pytest tests/test_X.py -l

Re-run Last Failed

uv run pytest --lf

Common Failure Modes

Symptom	Likely Cause	Fix
`ImportError` for a module	Missing dependency or 1-space indent issue	Check pyproject.toml; run `uv sync`
`live_gui` times out	Previous test left a process running	`taskkill /F /IM python.exe` to clean up
`get_value` returns `None`	Field not registered as gettable	Add to `self.controller._gettable_fields` in `gui_2.py`
`custom_callback` does nothing	Callback not registered	Add to `self.controller._predefined_callbacks`
`IM_ASSERT: Must call EndChild()`	Modal end_child/end pairing broken (usually from a buggy action)	Wrap actions in try/except; check for `imgui.end_child()` before `imgui.end()`
`pytest.fail` from `live_gui` startup	Hook server didn't start in 15s	Check `logs/gui_2_py_test.log` for crash

The `Audit Scripts`

The project has 4 audit scripts that enforce static conventions. They run as pre-commit/CI gates and exit non-zero on regression.

Script	Enforces	Run command
`scripts/check_test_toml_paths.py`	No real-TOML references in `tests/` (must be sandboxed)	`python scripts/check_test_toml_paths.py`
`scripts/audit_main_thread_imports.py`	Main-thread-purity invariant: the main thread (entering `immapp.run()`) never imports a module heavier than `imgui_bundle` + lean `gui_2` skeleton. Heavy SDKs (`google.genai`, `anthropic`, `openai`, `fastapi`) are lazy-only.	`python scripts/audit_main_thread_imports.py`
`scripts/audit_weak_types.py`	Type-alias convention: 430 weak `dict[str, Any]` / `list[dict[...]]` / `Tuple[...]` types across 6 high-traffic files were replaced with named `TypeAlias`es in `src/type_aliases.py`. The audit enforces the convention going forward.	`python scripts/audit_weak_types.py`
`scripts/audit_no_models_config_io.py`	`AppController` is the single source of truth for config I/O; direct calls to `models.save_config` / `models.load_config` in `src/` are forbidden.	`python scripts/audit_no_models_config_io.py`

Per-script details:

check_test_toml_paths.py greps tests/ for direct ./<name>.toml references and exits 0 only if all tests are sandboxed. It's the enforcement mechanism for the "no real TOML in tests" rule. If violations are found, migrate the offender to use tmp_path + monkeypatch.

audit_main_thread_imports.py (added in startup_speedup_20260606) enforces the main-thread-purity invariant. It scans src/ for import statements that pull in heavy SDKs at module level. Each violation is a site that will re-introduce a multi-second startup cost. The startup_speedup track reduced violations from 67 to 62 (5 fixed; the remaining 62 are large refactors tracked in future work).

audit_weak_types.py (added in data_structure_strengthening_20260606) enforces the type-alias convention. The baseline file scripts/audit_weak_types.baseline.json records the post-refactor weak-type count (~60 after 86% reduction from 430 in 6 high-traffic files). New violations exit 1; the baseline lets you audit the delta.

audit_no_models_config_io.py enforces the "AppController is the sole config owner" rule from conductor/code_styleguides/config_state_owner.md.

Test Data Flow

A typical test goes through this lifecycle:

Test starts
  ├─> isolate_workspace (autouse)
  │     ├─> Creates tmp dir
  │     └─> Sets SLOP_* env vars
  │
  ├─> reset_paths (autouse)
  │     └─> paths.reset_resolved()
  │
  ├─> reset_ai_client (autouse)
  │     └─> Resets ai_client global state
  │
  ├─> (test body runs)
  │     ├─> If using live_gui: subprocess already running (session-scoped)
  │     ├─> Test makes API calls via ApiHookClient
  │     └─> Test asserts on returned values
  │
  └─> Teardown
        ├─> reset_paths runs again
        └─> (autouse) state cleanup

The live_gui session fixture runs once at the start of the test session and tears down once at the end. All tests in the session share the same sloppy.py process.

Known Gotchas (2026-06-05)

Authoring Robust `live_gui` Tests (Don't Assume Clean State)

live_gui is a session-scoped fixture. All tests in a session share the same sloppy.py subprocess. The subprocess is not restarted between tests; its internal state (Fonts, DisplaySize, internal caches, current theme, current workspace profile, current discussion, current MMA track) accumulates from the previous test.

This is a test-authoring contract, not a fixture bug. A test that "passes when run after test X" but "fails when run in isolation" is a fragile test. Robust live_gui tests must:

Not assume clean state. Before invoking an operation, explicitly verify the precondition via the Hook API (e.g. client.get_value("show_my_window"), client.get_mma_status(), client.get_session()). Do not assume a previous test set the state.
Use the wait-for-ready pattern, not fixed sleeps. time.sleep(1) is not enough for ImGui to stabilize in the first few render frames (use 3+ seconds, but better: use wait_for_event with a generous timeout, or poll client.get_status() until ImGui reports ready). Fixed sleeps are a code smell; if you reach for one, the right answer is almost always "poll a gettable field instead".
Reset state explicitly if the test depends on it. For tests that mutate state (e.g. "click button X"), reset the relevant state via Hook API in a try/finally so the next test starts from a known baseline. Alternatively, use a function-scoped helper that issues a reset_session callback before the test body.
Test both in the full suite AND in isolation before merging. If a test passes in the full suite but fails in isolation, the test is fragile — fix the test, don't add a "warmup" comment. Bisecting by pytest path::test -k "filter" or pytest --collect-only --quiet helps.

Use get_value/wait_for_event to assert ready, not just to assert success. Example:

def test_open_settings_modal(live_gui):
    client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
    # Wait for the modal to actually appear, not just for the click to dispatch
    assert client.get_value("show_settings_modal"), "settings modal did not open"

The get_value poll doubles as a wait-for-ready AND a correctness assertion.

Anti-pattern (fragile):

def test_open_settings_modal(live_gui):
    client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
    time.sleep(1)  # hope the modal opened
    assert some_cached_value["settings_open"] is True  # may be stale from a prior test

Pattern (robust):

def test_open_settings_modal(live_gui):
    client.reset_session()  # function-scoped helper; Hook API reset callback
    client.push_event("custom_callback", {"callback": "_toggle_settings", "args": []})
    assert client.get_value("show_settings_modal"), "settings modal did not open"

Early-Render C-Level Crashes (Defer-Not-Catch Pattern)

imgui.save_ini_settings_to_memory() (and similar raw imgui calls that read internal state) will crash the Python process at the C level (0xc0000005 access violation) if called before ImGui's internal state is fully initialized. This is not catchable from Python — try/except Exception cannot intercept native access violations.

Symptoms:

The sloppy.py subprocess disappears without a Python traceback.
The pytest output shows pytest.fail("Hook server did not start in 15s") (the subprocess died during startup).
Windows Event Viewer shows Faulting module: _imgui_bundle.cp311-win_amd64.pyd with exception code 0xc0000005.

Fix pattern: defer-not-catch. Track a one-shot "ready" flag in the instance state; return early on the first call, only invoking the C function on subsequent calls:

def _capture_workspace_profile(self, name: str) -> models.WorkspaceProfile:
    if not getattr(self, "_ini_capture_ready", False):
        self._ini_capture_ready = True
        return models.WorkspaceProfile(name=name, docking_layout=b"", ...)
    ini = imgui.save_ini_settings_to_memory()
    return models.WorkspaceProfile(name=name, docking_layout=ini.encode("utf-8") if isinstance(ini, str) else ini, ...)

The first call (during initial startup) returns a safe empty profile and flips the flag; subsequent calls (when the user actually clicks "Save Profile") invoke the C function. The user's workflow is unaffected because the first call is non-blocking and the user cannot have clicked "Save Profile" before the GUI was fully rendered.

See src/gui_2.py:601-606 for the canonical implementation. This pattern unblocks 4-5 live_gui tests that were crashing the GUI subprocess during the first render frames after _capture_workspace_profile was invoked by the test (typically via a save_workspace_profile Hook API callback).

Sentinel type contract. When implementing a defer-not-catch guard, the early-return sentinel value must match the type contract of the downstream consumer. For WorkspaceProfile.ini_content: str (in this codebase), the sentinel must be "" (str), not b"" (bytes) — tomli_w rejects bytes (TypeError: Object of type 'bytes' is not TOML serializable), and imgui.load_ini_settings_from_memory(ini_data: str, ...) also expects str. A previous version of this fix used b"" and silently broke the save flow via a TypeError raised by tomli_w.dump; tests passed unit-test-wise but failed in the live_gui save+load round-trip. The fix was a 1-character change (b"" → ""). The regression test in tests/test_workspace_profile_serialization.py encodes this contract.

Pattern: Narrow Test Paths vs. Kitchen-Sink Functions

Anti-pattern: calling a kitchen-sink function. A test that does gui_2.render_main_interface(app_instance) requires mocking 50+ imgui/imscope methods because render_main_interface dispatches to dozens of nested render functions. Adding a single mock for imscope.window (to return a tuple) just reveals the next un-mocked dependency (e.g. imgui.begin returning bool where a 2-tuple is expected). The test never reaches its assertion.

Better pattern: test the narrow function. Most render flows have a dedicated sub-function (e.g. render_prior_session_view, render_preset_manager_window, render_theme_panel). Refactor the test to call the narrow function directly with mocks scoped to what that function actually uses. Example outcome:

render_main_interface test: 50+ mocks, ~6s runtime, flakiness on every un-mocked imgui call.
render_prior_session_view test: 20 mocks, ~0.08s runtime, stable.

When to refactor vs. add mocks:

If the test intent is "verify push/pop balance in the prior-session render path", call the narrow function.
If the test intent is "verify the whole GUI render path is correct", accept the 50+ mock cost (and ensure all mocks are correct).

See the prior_session_test_harden_20260605 plan in docs/superpowers/plans/ for the concrete refactor example.

Pattern: Indentation-Driven Method Visibility

The bug: A class method defined with the right intent (2-space indent) may be parsed as nested inside a previous function if indentation is off by even one space. The file "passes" syntactically (imports OK) but the method is not on the class — hasattr(App, 'method_name') returns False. Any production code that calls app.method_name falls through to __getattr__, which delegates to the controller (which also doesn't have the method), and a cryptic AttributeError is raised at runtime.

How to detect:

Use AST to list all App methods: uv run python -c "import ast; tree = ast.parse(open('src/gui_2.py').read()); [print(item.name) for n in ast.walk(tree) if isinstance(n, ast.ClassDef) and n.name == 'App' for item in n.body if isinstance(item, ast.FunctionDef)]".
The skeleton via manual-slop_py_get_skeleton should show the method as a class member.

How to fix: Re-indent the affected method to 2-space class level. Run the failing test to confirm. See the live_gui_test_hardening_v2_20260605 track in conductor/tracks.md for the concrete example (where _capture_workspace_profile was being parsed as nested inside _apply_snapshot due to a 1-space indentation drift after a cleanup commit).

37 KiB Raw Blame History

Testing Guide

Overview

Test File Layout

The conftest.py Fixtures

Autouse Fixtures (Run Before Every Test)

isolate_workspace (line 70)

reset_paths (line 95)

reset_ai_client (line 107)

Function-Scoped Fixtures (Opt-in)

vlogger (line 131)

kill_process_tree (function, line 138)

mock_app (line 157)

app_instance (line 190)

Session-Scoped Fixtures (One Per Test Run)

live_gui (line 227)

Per-test Subprocess Resilience (2026-06-09)

_LiveGuiHandle class (tests/conftest.py:393)

live_gui_workspace fixture (tests/conftest.py:727)

_check_live_gui_health autouse fixture (tests/conftest.py:650)

clean_baseline marker

_reset_clean_baseline autouse fixture

Watchdog and Hang Bounding

Chroma Cache Path and Cross-Test Pollution

xdist Worker Coordination and Stale Lock Demotion

Required Test Dependencies Gate

MMA and RAG State in reset_session()

_LiveGuiHandle Indexing (__getitem__)

Test Categories

1. Unit Tests (no fixtures, fast)

2. Integration Tests (use live_gui, slow)

3. Mock App Tests (use mock_app or app_instance, fast)

4. Headless Tests (no GUI, real services)

5. Opt-in Tests (gated by env var)

Markers

The Hook API (For Integration Tests)

Key Methods

predefined_callbacks Pattern

Common Patterns

Testing a Pure Function

Testing with a Mock App

Testing via live_gui

Testing an Exception Path

Parametrized Tests

Test Configuration

pyproject.toml

Coverage

Running Tests

All Tests

Specific Test File

Specific Test

Batched Run (Categorized)

By Marker

With Stop on First Failure

With Timeout

Adding a New Test

For a Pure Function

For an Integration Test

For an Opt-in Test (Clean Install / Docker)

Debugging Failed Tests

Verbose Output

Stop at First Failure

Enter PDB on Failure

Show Local Variables on Failure

Re-run Last Failed

Common Failure Modes

The Audit Scripts

Test Data Flow

Known Gotchas (2026-06-05)

Authoring Robust live_gui Tests (Don't Assume Clean State)

Early-Render C-Level Crashes (Defer-Not-Catch Pattern)

Pattern: Narrow Test Paths vs. Kitchen-Sink Functions

Pattern: Indentation-Driven Method Visibility

See Also

37 KiB

Raw Blame History

The `conftest.py` Fixtures

`isolate_workspace` (line 70)

`reset_paths` (line 95)

`reset_ai_client` (line 107)

`vlogger` (line 131)

`kill_process_tree` (function, line 138)

`mock_app` (line 157)

`app_instance` (line 190)

`live_gui` (line 227)

`_LiveGuiHandle` class (tests/conftest.py:393)

`live_gui_workspace` fixture (tests/conftest.py:727)

`_check_live_gui_health` autouse fixture (tests/conftest.py:650)

`clean_baseline` marker

`_reset_clean_baseline` autouse fixture

MMA and RAG State in `reset_session()`

`_LiveGuiHandle` Indexing (`getitem`)

2. Integration Tests (use `live_gui`, slow)

3. Mock App Tests (use `mock_app` or `app_instance`, fast)

`predefined_callbacks` Pattern

`pyproject.toml`

The `Audit Scripts`

Authoring Robust `live_gui` Tests (Don't Assume Clean State)