Private
Public Access
0
0
Files
manual_slop/docs/reports/test_suite_audit_20260627.md
T
ed 2b392b1f76 docs(audit): test suite analysis — cruft, test engine opportunities, ordering taxonomy
Comprehensive audit of 393 test files + the run_tests_batched runner.
Findings:
- 6 skip markers (4 same root cause: Gemini 503 in summarize.summarise_file)
- 60 files use time.sleep (38 live_gui — the banned anti-pattern)
- ~12-14 one-shot phase tests are cruft (verifying completed phases)
- 3 redundant test clusters (history: 5 files, theme: 6, markdown: 5)
- 27 live_gui tests are high-value test engine upgrade candidates
- ~44 live_gui tests are fine with the current Hook API
- ~10 new test capabilities enabled by the test engine (docking, focus, resize, keyboard, screenshots)
- The core batch is 245 files (62% of suite) — needs criticality-based splitting

Proposes a 3-dimension ordering taxonomy: (criticality, fixture, subsystem)
with 6 criticality levels (C0-smoke through C5-stress). The live_gui tier
mixes C0/C3/C4/C5 — splitting by criticality enables fast-fail + targeted
verification.

Recommends 4-track sequence: test_engine_integration → cruft_cleanup →
ordering_taxonomy → test_engine_migration.
2026-06-27 16:00:35 -04:00

19 KiB

Test Suite Audit: Cruft, Test Engine Opportunities, Ordering Taxonomy

Date: 2026-06-27 Branch: tier2/post_module_taxonomy_de_cruft_20260627 at 60f4c67e Scope: 393 test files in tests/ + the run_tests_batched.py runner + categorizer.py + batcher.py + test_categories.toml


1. Current Test Suite Inventory

By fixture class (the current tiering dimension)

Fixture class Tier Count Description
unit 1 296 Pure unit tests; no fixtures or lightweight mocks
mock_app 2 35 Uses the mock_app fixture (mocked App/AppController)
live_gui 3 58 Session-scoped real GUI subprocess via Hook API
opt_in 0 2 Clean install + docker build (env-var gated)
performance P 4 Perf/stress tests
headless H 1 Headless mode test
Total 396 (393 unique test_*.py + 3 _sim.py/_e2e.py counted separately)

By batch group (the sub-tier grouping)

Batch group Tier-1 (unit) Tier-2 (mock_app) Tier-3 (live_gui) Total
core 245 16 7 268
gui 21 9 28 58
mma 21 7 14 42
comms 7 2 0 9
headless 2 1 0 3

The core batch group at tier-1 is 245 files — 62% of the entire suite in a single xdist batch. This is the largest single batch and the primary bottleneck for targeted verification.

Current ordering mechanism

The runner uses a 2-level sort:

  1. Fixture class (the tier): 0 → 1 → 2 → 3 → H → P
  2. Batch group (within each tier): alphabetical (comms → core → gui → headless → mma)

Within each batch, pytest's default collection order (file modification time, then alphabetical within a file) applies. There is no assertion-criticality ordering — a test that asserts "the GUI started" runs in the same batch as a test that asserts "the MMA DAG engine correctly handles transitive blocking propagation."


2. Cruft Inventory

2.1 Skip markers (6 sites)

File Line Reason Status
test_aggregate_flags.py 5 Gemini 503 flake in summarize.summarise_file Documented; deferred to follow-up
test_context_composition_phase6.py 5 Gemini 503 flake (same pattern) Documented; deferred
test_context_composition_phase6.py 81 Gemini 503 flake (same pattern) Documented; deferred
test_context_composition_phase6.py 153 Gemini 503 flake (same pattern) Documented; deferred
test_mma_step_mode_sim.py 24 @pytest.mark.skipif (env-gated) Legitimate opt-in
test_test_sandbox.py 417 @pytest.mark.skipif(os.name != "nt") Legitimate platform gate

Assessment: 4 of 6 are the same root cause (Gemini 503 in summarize.summarise_file). The fix is mocking the Gemini API call — a single track eliminates all 4 skips. The 2 skipif markers are legitimate (env-gated opt-in + platform gate).

2.2 time.sleep usage (60 files — 15% of the suite)

60 test files use time.sleep, the anti-pattern explicitly banned in workflow.md "Anti-Pattern: push_event + time.sleep(N) + assert." The distribution:

Category Count Risk
live_gui tests with time.sleep 38 High — guaranteed race in batched runs
mock_app tests with time.sleep 5 Medium — mocked, but still fragile
unit tests with time.sleep 12 Low — usually in setup/teardown, not assertions
performance tests with time.sleep 2 Low — intentional for perf measurement
opt_in tests with time.sleep 2 Low — gated
headless tests with time.sleep 1 Low

The 38 live_gui tests with time.sleep are the primary cruft. Each one is a latent race condition. The test engine's wait_for_test_results(timeout) + ctx.item_click (which blocks until the action completes) would replace these.

2.3 One-shot verification tests (likely cruft)

These tests were written to verify a specific phase shipped. The phase is long since complete; the test is still running every batch:

File What it verified Still relevant?
test_phase_3_final_verify.py Phase 3 final verification No — phase shipped months ago
test_rag_phase4_final_verify.py RAG Phase 4 final verification No — phase shipped
test_rag_phase4_stress.py RAG Phase 4 stress test Maybe — stress tests have ongoing value
test_arch_boundary_phase1.py Arch boundary phase 1 No — superseded by phase 2/3
test_arch_boundary_phase2.py Arch boundary phase 2 Maybe — regression guard
test_arch_boundary_phase3.py Arch boundary phase 3 Maybe — regression guard
test_code_path_audit_phase78.py Code path audit phases 7-8 No — audit closed
test_code_path_audit_phase89.py Code path audit phases 8-9 No — audit closed
test_context_composition_phase3.py Context composition phase 3 No — superseded
test_context_composition_phase4.py Context composition phase 4 No — superseded
test_context_composition_phase6.py Context composition phase 6 No — superseded (3 of 4 tests skipped anyway)
test_gui_phase3.py GUI phase 3 No — superseded
test_gui_phase4.py GUI phase 4 No — superseded
test_metadata_promotion_phase1.py Metadata promotion phase 1 Maybe — regression guard
test_mma_agent_focus_phase1.py MMA agent focus phase 1 No — superseded by phase 3
test_mma_agent_focus_phase3.py MMA agent focus phase 3 Maybe — regression guard
test_phase6_engine.py Phase 6 engine Maybe — regression guard
test_phase6_simulation.py Phase 6 simulation Maybe — regression guard
test_fixes_20260517.py Fixes from May 17 No — one-shot fix verification
test_project_context_20260627.py Project context (dated) Maybe — recent enough to keep

Assessment: ~12-14 of 20 one-shot phase tests are cruft. The ones marked "Maybe" are regression guards for features that could still break; the "No" ones are verifying completed phases that won't regress unless someone reverts the feature.

2.4 Potentially redundant test clusters

Cluster Files Issue
History test_history.py, test_history_management.py, test_history_manager.py, test_history_message.py, test_orchestrator_pm_history.py 5 files for history; likely overlapping coverage
Theme test_theme.py, test_theme_2_no_top_level_nerv.py, test_theme_models.py, test_theme_nerv.py, test_theme_nerv_alert.py, test_theme_nerv_fx.py 6 files for theme; some are import-tests, some are functional
Markdown table test_markdown_table.py, test_markdown_table_columns.py, test_markdown_table_render.py, test_markdown_table_wrapped.py, test_markdown_helper_no_top_level_table.py 5 files for markdown tables; likely overlapping
Audit scripts 10 test_audit_*.py files Each tests a different audit script; not redundant, but heavy for "test the tests"

2.5 Import-only tests (structural but low value)

10+ test_*_no_top_level_*.py files test that specific modules don't import heavy dependencies at module level. These were critical during the startup_speedup campaign but are now regression guards. They're cheap to run (unit tier) but add to the 245-file core batch.


3. Test Engine Upgrade Opportunities

3.1 Tests that would benefit from the ImGui Test Engine (high-value upgrades)

These are tests where the current Hook API cannot express the interaction being tested, or where time.sleep makes them fragile. The test engine's ctx.dock_into, ctx.window_focus, ctx.window_resize, ctx.item_click, ctx.capture_screenshot_window would replace the current Puppeteer-style approach.

Test file What it tests Test engine primitive that upgrades it
test_workspace_profiles_sim.py Save/restore docking layout via show_windows dict + save_workspace_profile callback ctx.dock_into + ctx.window_focus + ctx.capture_screenshot_window for visual regression
test_auto_switch_sim.py Auto-switch workspace profile based on MMA tier ctx.dock_into + ctx.window_focus + state assertion via ctx.item_info
test_task_dag_popout_sim.py Pop-out panel to standalone viewport ctx.window_focus + ctx.window_info (is it docked or floating?)
test_usage_analytics_popout_sim.py Pop-out usage analytics panel Same as above
test_preset_windows_layout.py Preset window layout restoration ctx.dock_into + ctx.capture_screenshot_window
test_gui_text_viewer.py Text viewer rendering + docking ctx.window_focus + ctx.scroll_to_item + ctx.capture_screenshot_window
test_gui_context_presets.py Context preset panel interactions ctx.item_click + ctx.item_check
test_tool_management_layout.py Tool management panel layout ctx.item_click + ctx.window_info
test_selectable_ui.py Selectable text rendering ctx.item_click + ctx.item_info (is the item selectable?)
test_command_palette_sim.py Command palette open + search + select ctx.key_chars + ctx.item_click + ctx.key_press (Enter)
test_undo_redo_sim.py Undo/redo workspace state ctx.key_press (Ctrl+Z, Ctrl+Y) + state assertion
test_mma_step_mode_sim.py MMA step mode approval flow ctx.item_click (approve button) + ctx.item_info (is it enabled?)
test_mma_concurrent_tracks_sim.py Concurrent track execution UI ctx.item_click + ctx.window_info (stream visibility)
test_visual_mma.py MMA visual dashboard ctx.capture_screenshot_window + ctx.item_click
test_visual_orchestration.py Visual orchestration panel ctx.capture_screenshot_window
test_visual_sim_gui_ux.py GUI UX event routing + performance ctx.item_click + ctx.capture_screenshot_window
test_visual_sim_mma_v2.py MMA visual simulation v2 ctx.capture_screenshot_window
test_z_negative_flows.py Negative flow handling ctx.item_click + ctx.item_info (error state)
test_live_markdown_render.py Live markdown rendering ctx.capture_screenshot_window + ctx.scroll_to_item
test_live_workflow.py Live workflow end-to-end ctx.item_click + ctx.key_chars + ctx.capture_screenshot_window
test_reset_session_clears_mma_and_rag.py Session reset clears state ctx.item_click (reset button) + state assertion
test_saved_presets_sim.py Preset save/load ctx.item_click + ctx.item_info
test_system_prompt_sim.py System prompt switching ctx.item_click + ctx.item_info
test_gui_stress_performance.py Stress performance ctx.capture_screenshot_window (visual regression under load)
test_gui_performance_requirements.py Performance requirements ctx.capture_screenshot_window + FPS check
test_gui_startup_smoke.py Startup smoke test ctx.window_info (is the main window visible?)
test_hooks.py Hook API integration ctx.item_click + state assertion (replace time.sleep polling)

Total: 27 live_gui tests are high-value upgrade candidates. That's 47% of the 58 live_gui tests.

3.2 Tests that are fine with the current Hook API (low-value upgrades)

These tests use the Hook API for state mutation + assertion, which the test engine doesn't improve on:

Test pattern Count Why the Hook API is sufficient
Provider tests (gemini, deepseek, minimax, grok, qwen, llama) ~8 These test API responses, not UI
API hook endpoint tests ~6 These test the HTTP endpoints themselves
MMA model/logic tests ~10 These test data structures, not rendering
RAG engine tests ~5 These test the search engine, not UI
Import/audit tests ~15 These are AST/import checks, no UI

Total: ~44 tests are fine as-is. The test engine adds no value for pure-logic or pure-API tests.

3.3 Tests that become possible ONLY with the test engine (new capabilities)

These interactions are impossible with the current Hook API and would be new tests enabled by the test engine:

Capability Test engine primitive What it enables
Drag-and-drop docking ctx.dock_into(src, dst, dir) Test that dragging the Context Hub into the Session Hub docks correctly
Window focus order ctx.window_focus(ref) Test that clicking a panel brings it to front
Window resize ctx.window_resize(ref, sz) Test that resizing a panel doesn't break rendering
Keyboard shortcuts ctx.key_press(KeyChord) Test Ctrl+Z/Ctrl+Y, Ctrl+Shift+P (command palette), etc.
Tab close ctx.tab_close(ref) Test that closing a discussion tab removes it
Table column resize ctx.table_resize_column(ref, col, width) Test that resizing a table column persists
Screenshot diff ctx.capture_screenshot_window(ref) Visual regression: compare rendering across commits
Item hover ctx.item_hold(ref, time) Test tooltip behavior
Multi-step input ctx.key_chars("text") + ctx.key_press(Enter) Test form input flows
Tree open/close ctx.item_open_all(ref) Test tree expansion behavior

4. Proposed Ordering Taxonomy (Assertion-Criticality-Based)

The problem with the current ordering

The current 2-level sort (fixture class → batch group) has no notion of what a test actually asserts. A test that verifies "the app starts" runs in the same batch as a test that verifies "the MMA DAG engine handles transitive blocking." If the former fails, the latter's result is meaningless (the app is broken, so everything downstream is suspect). But the batch runner reports them as independent pass/fail.

Proposed taxonomy: 3 dimensions

Dimension 1: Assertion criticality (the new ordering key)

Level Name Description Examples
C0 Smoke "Does the app start and respond?" Hook server health, GUI startup, basic import
C1 Structural "Do the core subsystems exist and have the right shape?" Dataclass field checks, type alias existence, model import
C2 Behavioral "Do the core subsystems behave correctly in isolation?" DAG engine cycle detection, history push/undo, error handling Result[T]
C3 Integration "Do subsystems compose correctly?" Hook API → controller → GUI task dispatch, AI client → provider dispatch
C4 UI/Visual "Does the GUI render correctly and respond to user input?" Docking, focus, panel visibility, command palette, undo/redo
C5 Stress/Perf "Does it hold up under load?" Concurrent tracks, stress performance, batch resilience

Dimension 2: Fixture class (the existing tiering — retained)

Fixture Current tier Maps to
unit 1 C1 + C2 (mostly)
mock_app 2 C2 + C3 (mostly)
live_gui 3 C3 + C4 + C5 (mixed!)
headless H C3
performance P C5
opt_in 0 C0 (clean install)

Key insight: the live_gui tier (58 tests) is currently a single monolithic batch that mixes C0 (smoke), C3 (integration), C4 (UI/visual), and C5 (stress) tests. Splitting it by criticality would allow:

  • Running C0 smoke first (fast fail if the app is broken)
  • Running C4 UI tests with the test engine (slower but high-fidelity)
  • Running C5 stress tests last (only if C0-C4 pass)

Dimension 3: Subsystem (the existing batch group — retained)

The core/gui/mma/comms/headless grouping stays. It's useful for targeted runs ("just run the gui tests").

Proposed ordering: (criticality, fixture, subsystem)

C0-smoke (any fixture)       ← run first; fast fail
  C1-structural (unit)        ← run second; cheap
    C2-behavioral (unit)       ← run third; still cheap
      C2-behavioral (mock_app) ← run fourth; mocked integration
        C3-integration (mock_app) ← run fifth
          C3-integration (live_gui) ← run sixth; real subprocess
            C4-ui (live_gui + test_engine) ← run seventh; high-fidelity
              C5-stress (live_gui) ← run last; only if C0-C4 pass
                C5-perf (performance) ← run last; opt-in

How this maps to the batched runner

The categorizer.py would gain a criticality: Criticality field on CategoryRecord. The batcher.py would sort by (criticality, fixture_class, batch_group) instead of (fixture_class, batch_group). The test_categories.toml registry would allow manual override of criticality for specific tests.

The run_tests_batched.py --plan output would show the criticality level:

[RUN] C0-smoke-any:      3 files, est 2s     ← hook health, GUI startup
[RUN] C1-structural-core: 45 files, est 22s  ← dataclass + type checks
[RUN] C2-behavioral-core:  120 files, est 60s ← logic tests
[RUN] C2-behavioral-gui:   15 files, est 8s
[RUN] C3-integration-mma:  12 files, est 15s
[RUN] C4-ui-gui:           27 files, est 40s  ← test engine tests (post-integration)
[RUN] C5-stress-gui:       5 files, est 20s

Migration path

  1. Phase 1: Add the criticality field to CategoryRecord + auto-inference rules. Default all existing tests to C2 (behavioral) — the current median. Manual overrides in test_categories.toml for C0, C1, C3, C4, C5 tests.
  2. Phase 2: Update batcher.py to sort by (criticality, fixture_class, batch_group).
  3. Phase 3: Curate the criticality assignments — audit each test, assign the correct level. This is the bulk of the work; can be done incrementally.
  4. Phase 4 (post-test-engine): Re-classify the 27 test-engine-upgrade candidates as C4-ui. The test engine enables higher-fidelity assertions for these.

5. Summary

Category Count Action
Total test files 393
Skip markers 6 (4 same root cause) Mock Gemini API in summarize.summarise_file → eliminates 4 skips
time.sleep users 60 (38 live_gui) Replace with poll loops (Hook API) or ctx.wait_for_test_results (test engine)
One-shot phase tests (cruft) ~12-14 Delete or consolidate into regression suites
Redundant clusters 3 clusters (history: 5, theme: 6, markdown: 5) Audit for overlap; consolidate
Test engine upgrade candidates 27 live_gui tests Migrate after test_engine_integration_20260627 ships
Tests fine as-is ~44 live_gui + all unit/mock_app No change needed
New tests enabled by test engine ~10 capabilities Docking, focus, resize, keyboard, screenshots
core batch (245 files, 62%) 245 Split by criticality for targeted verification
  1. test_engine_integration_20260627 (initialized) — build the bridge
  2. test_suite_cruft_cleanup_<date> (new) — delete one-shot cruft, fix Gemini 503 skips, consolidate redundant clusters, replace time.sleep with poll loops
  3. test_ordering_taxonomy_<date> (new) — add the criticality dimension to the batched runner
  4. test_engine_migration_<date> (Campaign A Track 2) — migrate the 27 high-value live_gui tests to the test engine