Comprehensive audit of 393 test files + the run_tests_batched runner. Findings: - 6 skip markers (4 same root cause: Gemini 503 in summarize.summarise_file) - 60 files use time.sleep (38 live_gui — the banned anti-pattern) - ~12-14 one-shot phase tests are cruft (verifying completed phases) - 3 redundant test clusters (history: 5 files, theme: 6, markdown: 5) - 27 live_gui tests are high-value test engine upgrade candidates - ~44 live_gui tests are fine with the current Hook API - ~10 new test capabilities enabled by the test engine (docking, focus, resize, keyboard, screenshots) - The core batch is 245 files (62% of suite) — needs criticality-based splitting Proposes a 3-dimension ordering taxonomy: (criticality, fixture, subsystem) with 6 criticality levels (C0-smoke through C5-stress). The live_gui tier mixes C0/C3/C4/C5 — splitting by criticality enables fast-fail + targeted verification. Recommends 4-track sequence: test_engine_integration → cruft_cleanup → ordering_taxonomy → test_engine_migration.
19 KiB
Test Suite Audit: Cruft, Test Engine Opportunities, Ordering Taxonomy
Date: 2026-06-27
Branch: tier2/post_module_taxonomy_de_cruft_20260627 at 60f4c67e
Scope: 393 test files in tests/ + the run_tests_batched.py runner + categorizer.py + batcher.py + test_categories.toml
1. Current Test Suite Inventory
By fixture class (the current tiering dimension)
| Fixture class | Tier | Count | Description |
|---|---|---|---|
unit |
1 | 296 | Pure unit tests; no fixtures or lightweight mocks |
mock_app |
2 | 35 | Uses the mock_app fixture (mocked App/AppController) |
live_gui |
3 | 58 | Session-scoped real GUI subprocess via Hook API |
opt_in |
0 | 2 | Clean install + docker build (env-var gated) |
performance |
P | 4 | Perf/stress tests |
headless |
H | 1 | Headless mode test |
| Total | 396 | (393 unique test_*.py + 3 _sim.py/_e2e.py counted separately) |
By batch group (the sub-tier grouping)
| Batch group | Tier-1 (unit) | Tier-2 (mock_app) | Tier-3 (live_gui) | Total |
|---|---|---|---|---|
core |
245 | 16 | 7 | 268 |
gui |
21 | 9 | 28 | 58 |
mma |
21 | 7 | 14 | 42 |
comms |
7 | 2 | 0 | 9 |
headless |
2 | 1 | 0 | 3 |
The core batch group at tier-1 is 245 files — 62% of the entire suite in a single xdist batch. This is the largest single batch and the primary bottleneck for targeted verification.
Current ordering mechanism
The runner uses a 2-level sort:
- Fixture class (the tier):
0 → 1 → 2 → 3 → H → P - Batch group (within each tier): alphabetical (
comms → core → gui → headless → mma)
Within each batch, pytest's default collection order (file modification time, then alphabetical within a file) applies. There is no assertion-criticality ordering — a test that asserts "the GUI started" runs in the same batch as a test that asserts "the MMA DAG engine correctly handles transitive blocking propagation."
2. Cruft Inventory
2.1 Skip markers (6 sites)
| File | Line | Reason | Status |
|---|---|---|---|
test_aggregate_flags.py |
5 | Gemini 503 flake in summarize.summarise_file |
Documented; deferred to follow-up |
test_context_composition_phase6.py |
5 | Gemini 503 flake (same pattern) | Documented; deferred |
test_context_composition_phase6.py |
81 | Gemini 503 flake (same pattern) | Documented; deferred |
test_context_composition_phase6.py |
153 | Gemini 503 flake (same pattern) | Documented; deferred |
test_mma_step_mode_sim.py |
24 | @pytest.mark.skipif (env-gated) |
Legitimate opt-in |
test_test_sandbox.py |
417 | @pytest.mark.skipif(os.name != "nt") |
Legitimate platform gate |
Assessment: 4 of 6 are the same root cause (Gemini 503 in summarize.summarise_file). The fix is mocking the Gemini API call — a single track eliminates all 4 skips. The 2 skipif markers are legitimate (env-gated opt-in + platform gate).
2.2 time.sleep usage (60 files — 15% of the suite)
60 test files use time.sleep, the anti-pattern explicitly banned in workflow.md "Anti-Pattern: push_event + time.sleep(N) + assert." The distribution:
| Category | Count | Risk |
|---|---|---|
live_gui tests with time.sleep |
38 | High — guaranteed race in batched runs |
mock_app tests with time.sleep |
5 | Medium — mocked, but still fragile |
unit tests with time.sleep |
12 | Low — usually in setup/teardown, not assertions |
performance tests with time.sleep |
2 | Low — intentional for perf measurement |
opt_in tests with time.sleep |
2 | Low — gated |
headless tests with time.sleep |
1 | Low |
The 38 live_gui tests with time.sleep are the primary cruft. Each one is a latent race condition. The test engine's wait_for_test_results(timeout) + ctx.item_click (which blocks until the action completes) would replace these.
2.3 One-shot verification tests (likely cruft)
These tests were written to verify a specific phase shipped. The phase is long since complete; the test is still running every batch:
| File | What it verified | Still relevant? |
|---|---|---|
test_phase_3_final_verify.py |
Phase 3 final verification | No — phase shipped months ago |
test_rag_phase4_final_verify.py |
RAG Phase 4 final verification | No — phase shipped |
test_rag_phase4_stress.py |
RAG Phase 4 stress test | Maybe — stress tests have ongoing value |
test_arch_boundary_phase1.py |
Arch boundary phase 1 | No — superseded by phase 2/3 |
test_arch_boundary_phase2.py |
Arch boundary phase 2 | Maybe — regression guard |
test_arch_boundary_phase3.py |
Arch boundary phase 3 | Maybe — regression guard |
test_code_path_audit_phase78.py |
Code path audit phases 7-8 | No — audit closed |
test_code_path_audit_phase89.py |
Code path audit phases 8-9 | No — audit closed |
test_context_composition_phase3.py |
Context composition phase 3 | No — superseded |
test_context_composition_phase4.py |
Context composition phase 4 | No — superseded |
test_context_composition_phase6.py |
Context composition phase 6 | No — superseded (3 of 4 tests skipped anyway) |
test_gui_phase3.py |
GUI phase 3 | No — superseded |
test_gui_phase4.py |
GUI phase 4 | No — superseded |
test_metadata_promotion_phase1.py |
Metadata promotion phase 1 | Maybe — regression guard |
test_mma_agent_focus_phase1.py |
MMA agent focus phase 1 | No — superseded by phase 3 |
test_mma_agent_focus_phase3.py |
MMA agent focus phase 3 | Maybe — regression guard |
test_phase6_engine.py |
Phase 6 engine | Maybe — regression guard |
test_phase6_simulation.py |
Phase 6 simulation | Maybe — regression guard |
test_fixes_20260517.py |
Fixes from May 17 | No — one-shot fix verification |
test_project_context_20260627.py |
Project context (dated) | Maybe — recent enough to keep |
Assessment: ~12-14 of 20 one-shot phase tests are cruft. The ones marked "Maybe" are regression guards for features that could still break; the "No" ones are verifying completed phases that won't regress unless someone reverts the feature.
2.4 Potentially redundant test clusters
| Cluster | Files | Issue |
|---|---|---|
| History | test_history.py, test_history_management.py, test_history_manager.py, test_history_message.py, test_orchestrator_pm_history.py |
5 files for history; likely overlapping coverage |
| Theme | test_theme.py, test_theme_2_no_top_level_nerv.py, test_theme_models.py, test_theme_nerv.py, test_theme_nerv_alert.py, test_theme_nerv_fx.py |
6 files for theme; some are import-tests, some are functional |
| Markdown table | test_markdown_table.py, test_markdown_table_columns.py, test_markdown_table_render.py, test_markdown_table_wrapped.py, test_markdown_helper_no_top_level_table.py |
5 files for markdown tables; likely overlapping |
| Audit scripts | 10 test_audit_*.py files |
Each tests a different audit script; not redundant, but heavy for "test the tests" |
2.5 Import-only tests (structural but low value)
10+ test_*_no_top_level_*.py files test that specific modules don't import heavy dependencies at module level. These were critical during the startup_speedup campaign but are now regression guards. They're cheap to run (unit tier) but add to the 245-file core batch.
3. Test Engine Upgrade Opportunities
3.1 Tests that would benefit from the ImGui Test Engine (high-value upgrades)
These are tests where the current Hook API cannot express the interaction being tested, or where time.sleep makes them fragile. The test engine's ctx.dock_into, ctx.window_focus, ctx.window_resize, ctx.item_click, ctx.capture_screenshot_window would replace the current Puppeteer-style approach.
| Test file | What it tests | Test engine primitive that upgrades it |
|---|---|---|
test_workspace_profiles_sim.py |
Save/restore docking layout via show_windows dict + save_workspace_profile callback |
ctx.dock_into + ctx.window_focus + ctx.capture_screenshot_window for visual regression |
test_auto_switch_sim.py |
Auto-switch workspace profile based on MMA tier | ctx.dock_into + ctx.window_focus + state assertion via ctx.item_info |
test_task_dag_popout_sim.py |
Pop-out panel to standalone viewport | ctx.window_focus + ctx.window_info (is it docked or floating?) |
test_usage_analytics_popout_sim.py |
Pop-out usage analytics panel | Same as above |
test_preset_windows_layout.py |
Preset window layout restoration | ctx.dock_into + ctx.capture_screenshot_window |
test_gui_text_viewer.py |
Text viewer rendering + docking | ctx.window_focus + ctx.scroll_to_item + ctx.capture_screenshot_window |
test_gui_context_presets.py |
Context preset panel interactions | ctx.item_click + ctx.item_check |
test_tool_management_layout.py |
Tool management panel layout | ctx.item_click + ctx.window_info |
test_selectable_ui.py |
Selectable text rendering | ctx.item_click + ctx.item_info (is the item selectable?) |
test_command_palette_sim.py |
Command palette open + search + select | ctx.key_chars + ctx.item_click + ctx.key_press (Enter) |
test_undo_redo_sim.py |
Undo/redo workspace state | ctx.key_press (Ctrl+Z, Ctrl+Y) + state assertion |
test_mma_step_mode_sim.py |
MMA step mode approval flow | ctx.item_click (approve button) + ctx.item_info (is it enabled?) |
test_mma_concurrent_tracks_sim.py |
Concurrent track execution UI | ctx.item_click + ctx.window_info (stream visibility) |
test_visual_mma.py |
MMA visual dashboard | ctx.capture_screenshot_window + ctx.item_click |
test_visual_orchestration.py |
Visual orchestration panel | ctx.capture_screenshot_window |
test_visual_sim_gui_ux.py |
GUI UX event routing + performance | ctx.item_click + ctx.capture_screenshot_window |
test_visual_sim_mma_v2.py |
MMA visual simulation v2 | ctx.capture_screenshot_window |
test_z_negative_flows.py |
Negative flow handling | ctx.item_click + ctx.item_info (error state) |
test_live_markdown_render.py |
Live markdown rendering | ctx.capture_screenshot_window + ctx.scroll_to_item |
test_live_workflow.py |
Live workflow end-to-end | ctx.item_click + ctx.key_chars + ctx.capture_screenshot_window |
test_reset_session_clears_mma_and_rag.py |
Session reset clears state | ctx.item_click (reset button) + state assertion |
test_saved_presets_sim.py |
Preset save/load | ctx.item_click + ctx.item_info |
test_system_prompt_sim.py |
System prompt switching | ctx.item_click + ctx.item_info |
test_gui_stress_performance.py |
Stress performance | ctx.capture_screenshot_window (visual regression under load) |
test_gui_performance_requirements.py |
Performance requirements | ctx.capture_screenshot_window + FPS check |
test_gui_startup_smoke.py |
Startup smoke test | ctx.window_info (is the main window visible?) |
test_hooks.py |
Hook API integration | ctx.item_click + state assertion (replace time.sleep polling) |
Total: 27 live_gui tests are high-value upgrade candidates. That's 47% of the 58 live_gui tests.
3.2 Tests that are fine with the current Hook API (low-value upgrades)
These tests use the Hook API for state mutation + assertion, which the test engine doesn't improve on:
| Test pattern | Count | Why the Hook API is sufficient |
|---|---|---|
| Provider tests (gemini, deepseek, minimax, grok, qwen, llama) | ~8 | These test API responses, not UI |
| API hook endpoint tests | ~6 | These test the HTTP endpoints themselves |
| MMA model/logic tests | ~10 | These test data structures, not rendering |
| RAG engine tests | ~5 | These test the search engine, not UI |
| Import/audit tests | ~15 | These are AST/import checks, no UI |
Total: ~44 tests are fine as-is. The test engine adds no value for pure-logic or pure-API tests.
3.3 Tests that become possible ONLY with the test engine (new capabilities)
These interactions are impossible with the current Hook API and would be new tests enabled by the test engine:
| Capability | Test engine primitive | What it enables |
|---|---|---|
| Drag-and-drop docking | ctx.dock_into(src, dst, dir) |
Test that dragging the Context Hub into the Session Hub docks correctly |
| Window focus order | ctx.window_focus(ref) |
Test that clicking a panel brings it to front |
| Window resize | ctx.window_resize(ref, sz) |
Test that resizing a panel doesn't break rendering |
| Keyboard shortcuts | ctx.key_press(KeyChord) |
Test Ctrl+Z/Ctrl+Y, Ctrl+Shift+P (command palette), etc. |
| Tab close | ctx.tab_close(ref) |
Test that closing a discussion tab removes it |
| Table column resize | ctx.table_resize_column(ref, col, width) |
Test that resizing a table column persists |
| Screenshot diff | ctx.capture_screenshot_window(ref) |
Visual regression: compare rendering across commits |
| Item hover | ctx.item_hold(ref, time) |
Test tooltip behavior |
| Multi-step input | ctx.key_chars("text") + ctx.key_press(Enter) |
Test form input flows |
| Tree open/close | ctx.item_open_all(ref) |
Test tree expansion behavior |
4. Proposed Ordering Taxonomy (Assertion-Criticality-Based)
The problem with the current ordering
The current 2-level sort (fixture class → batch group) has no notion of what a test actually asserts. A test that verifies "the app starts" runs in the same batch as a test that verifies "the MMA DAG engine handles transitive blocking." If the former fails, the latter's result is meaningless (the app is broken, so everything downstream is suspect). But the batch runner reports them as independent pass/fail.
Proposed taxonomy: 3 dimensions
Dimension 1: Assertion criticality (the new ordering key)
| Level | Name | Description | Examples |
|---|---|---|---|
| C0 | Smoke | "Does the app start and respond?" | Hook server health, GUI startup, basic import |
| C1 | Structural | "Do the core subsystems exist and have the right shape?" | Dataclass field checks, type alias existence, model import |
| C2 | Behavioral | "Do the core subsystems behave correctly in isolation?" | DAG engine cycle detection, history push/undo, error handling Result[T] |
| C3 | Integration | "Do subsystems compose correctly?" | Hook API → controller → GUI task dispatch, AI client → provider dispatch |
| C4 | UI/Visual | "Does the GUI render correctly and respond to user input?" | Docking, focus, panel visibility, command palette, undo/redo |
| C5 | Stress/Perf | "Does it hold up under load?" | Concurrent tracks, stress performance, batch resilience |
Dimension 2: Fixture class (the existing tiering — retained)
| Fixture | Current tier | Maps to |
|---|---|---|
unit |
1 | C1 + C2 (mostly) |
mock_app |
2 | C2 + C3 (mostly) |
live_gui |
3 | C3 + C4 + C5 (mixed!) |
headless |
H | C3 |
performance |
P | C5 |
opt_in |
0 | C0 (clean install) |
Key insight: the live_gui tier (58 tests) is currently a single monolithic batch that mixes C0 (smoke), C3 (integration), C4 (UI/visual), and C5 (stress) tests. Splitting it by criticality would allow:
- Running C0 smoke first (fast fail if the app is broken)
- Running C4 UI tests with the test engine (slower but high-fidelity)
- Running C5 stress tests last (only if C0-C4 pass)
Dimension 3: Subsystem (the existing batch group — retained)
The core/gui/mma/comms/headless grouping stays. It's useful for targeted runs ("just run the gui tests").
Proposed ordering: (criticality, fixture, subsystem)
C0-smoke (any fixture) ← run first; fast fail
C1-structural (unit) ← run second; cheap
C2-behavioral (unit) ← run third; still cheap
C2-behavioral (mock_app) ← run fourth; mocked integration
C3-integration (mock_app) ← run fifth
C3-integration (live_gui) ← run sixth; real subprocess
C4-ui (live_gui + test_engine) ← run seventh; high-fidelity
C5-stress (live_gui) ← run last; only if C0-C4 pass
C5-perf (performance) ← run last; opt-in
How this maps to the batched runner
The categorizer.py would gain a criticality: Criticality field on CategoryRecord. The batcher.py would sort by (criticality, fixture_class, batch_group) instead of (fixture_class, batch_group). The test_categories.toml registry would allow manual override of criticality for specific tests.
The run_tests_batched.py --plan output would show the criticality level:
[RUN] C0-smoke-any: 3 files, est 2s ← hook health, GUI startup
[RUN] C1-structural-core: 45 files, est 22s ← dataclass + type checks
[RUN] C2-behavioral-core: 120 files, est 60s ← logic tests
[RUN] C2-behavioral-gui: 15 files, est 8s
[RUN] C3-integration-mma: 12 files, est 15s
[RUN] C4-ui-gui: 27 files, est 40s ← test engine tests (post-integration)
[RUN] C5-stress-gui: 5 files, est 20s
Migration path
- Phase 1: Add the
criticalityfield toCategoryRecord+ auto-inference rules. Default all existing tests to C2 (behavioral) — the current median. Manual overrides intest_categories.tomlfor C0, C1, C3, C4, C5 tests. - Phase 2: Update
batcher.pyto sort by(criticality, fixture_class, batch_group). - Phase 3: Curate the criticality assignments — audit each test, assign the correct level. This is the bulk of the work; can be done incrementally.
- Phase 4 (post-test-engine): Re-classify the 27 test-engine-upgrade candidates as C4-ui. The test engine enables higher-fidelity assertions for these.
5. Summary
| Category | Count | Action |
|---|---|---|
| Total test files | 393 | — |
| Skip markers | 6 (4 same root cause) | Mock Gemini API in summarize.summarise_file → eliminates 4 skips |
time.sleep users |
60 (38 live_gui) | Replace with poll loops (Hook API) or ctx.wait_for_test_results (test engine) |
| One-shot phase tests (cruft) | ~12-14 | Delete or consolidate into regression suites |
| Redundant clusters | 3 clusters (history: 5, theme: 6, markdown: 5) | Audit for overlap; consolidate |
| Test engine upgrade candidates | 27 live_gui tests | Migrate after test_engine_integration_20260627 ships |
| Tests fine as-is | ~44 live_gui + all unit/mock_app | No change needed |
| New tests enabled by test engine | ~10 capabilities | Docking, focus, resize, keyboard, screenshots |
core batch (245 files, 62%) |
245 | Split by criticality for targeted verification |
Recommended track sequence
test_engine_integration_20260627(initialized) — build the bridgetest_suite_cruft_cleanup_<date>(new) — delete one-shot cruft, fix Gemini 503 skips, consolidate redundant clusters, replacetime.sleepwith poll loopstest_ordering_taxonomy_<date>(new) — add the criticality dimension to the batched runnertest_engine_migration_<date>(Campaign A Track 2) — migrate the 27 high-value live_gui tests to the test engine