Private

Public Access

Files

T

ed 2b392b1f76 docs(audit): test suite analysis — cruft, test engine opportunities, ordering taxonomy

Comprehensive audit of 393 test files + the run_tests_batched runner.
Findings:
- 6 skip markers (4 same root cause: Gemini 503 in summarize.summarise_file)
- 60 files use time.sleep (38 live_gui — the banned anti-pattern)
- ~12-14 one-shot phase tests are cruft (verifying completed phases)
- 3 redundant test clusters (history: 5 files, theme: 6, markdown: 5)
- 27 live_gui tests are high-value test engine upgrade candidates
- ~44 live_gui tests are fine with the current Hook API
- ~10 new test capabilities enabled by the test engine (docking, focus, resize, keyboard, screenshots)
- The core batch is 245 files (62% of suite) — needs criticality-based splitting

Proposes a 3-dimension ordering taxonomy: (criticality, fixture, subsystem)
with 6 criticality levels (C0-smoke through C5-stress). The live_gui tier
mixes C0/C3/C4/C5 — splitting by criticality enables fast-fail + targeted
verification.

Recommends 4-track sequence: test_engine_integration → cruft_cleanup →
ordering_taxonomy → test_engine_migration.

2026-06-27 16:00:35 -04:00

19 KiB

Raw Blame History

Test Suite Audit: Cruft, Test Engine Opportunities, Ordering Taxonomy

Date: 2026-06-27 Branch: tier2/post_module_taxonomy_de_cruft_20260627 at 60f4c67e Scope: 393 test files in tests/ + the run_tests_batched.py runner + categorizer.py + batcher.py + test_categories.toml

1. Current Test Suite Inventory

By fixture class (the current tiering dimension)

Fixture class	Tier	Count	Description
`unit`	1	296	Pure unit tests; no fixtures or lightweight mocks
`mock_app`	2	35	Uses the `mock_app` fixture (mocked App/AppController)
`live_gui`	3	58	Session-scoped real GUI subprocess via Hook API
`opt_in`	0	2	Clean install + docker build (env-var gated)
`performance`	P	4	Perf/stress tests
`headless`	H	1	Headless mode test
Total		396	(393 unique `test_*.py` + 3 `_sim.py`/`_e2e.py` counted separately)

By batch group (the sub-tier grouping)

Batch group	Tier-1 (unit)	Tier-2 (mock_app)	Tier-3 (live_gui)	Total
`core`	245	16	7	268
`gui`	21	9	28	58
`mma`	21	7	14	42
`comms`	7	2	0	9
`headless`	2	1	0	3

The core batch group at tier-1 is 245 files — 62% of the entire suite in a single xdist batch. This is the largest single batch and the primary bottleneck for targeted verification.

Current ordering mechanism

The runner uses a 2-level sort:

Fixture class (the tier): 0 → 1 → 2 → 3 → H → P
Batch group (within each tier): alphabetical (comms → core → gui → headless → mma)

Within each batch, pytest's default collection order (file modification time, then alphabetical within a file) applies. There is no assertion-criticality ordering — a test that asserts "the GUI started" runs in the same batch as a test that asserts "the MMA DAG engine correctly handles transitive blocking propagation."

2. Cruft Inventory

2.1 Skip markers (6 sites)

File	Line	Reason	Status
`test_aggregate_flags.py`	5	Gemini 503 flake in `summarize.summarise_file`	Documented; deferred to follow-up
`test_context_composition_phase6.py`	5	Gemini 503 flake (same pattern)	Documented; deferred
`test_context_composition_phase6.py`	81	Gemini 503 flake (same pattern)	Documented; deferred
`test_context_composition_phase6.py`	153	Gemini 503 flake (same pattern)	Documented; deferred
`test_mma_step_mode_sim.py`	24	`@pytest.mark.skipif` (env-gated)	Legitimate opt-in
`test_test_sandbox.py`	417	`@pytest.mark.skipif(os.name != "nt")`	Legitimate platform gate

Assessment: 4 of 6 are the same root cause (Gemini 503 in summarize.summarise_file). The fix is mocking the Gemini API call — a single track eliminates all 4 skips. The 2 skipif markers are legitimate (env-gated opt-in + platform gate).

2.2 `time.sleep` usage (60 files — 15% of the suite)

60 test files use time.sleep, the anti-pattern explicitly banned in workflow.md "Anti-Pattern: push_event + time.sleep(N) + assert." The distribution:

Category	Count	Risk
`live_gui` tests with `time.sleep`	38	High — guaranteed race in batched runs
`mock_app` tests with `time.sleep`	5	Medium — mocked, but still fragile
`unit` tests with `time.sleep`	12	Low — usually in setup/teardown, not assertions
`performance` tests with `time.sleep`	2	Low — intentional for perf measurement
`opt_in` tests with `time.sleep`	2	Low — gated
`headless` tests with `time.sleep`	1	Low

The 38 live_gui tests with time.sleep are the primary cruft. Each one is a latent race condition. The test engine's wait_for_test_results(timeout) + ctx.item_click (which blocks until the action completes) would replace these.

2.3 One-shot verification tests (likely cruft)

These tests were written to verify a specific phase shipped. The phase is long since complete; the test is still running every batch:

File	What it verified	Still relevant?
`test_phase_3_final_verify.py`	Phase 3 final verification	No — phase shipped months ago
`test_rag_phase4_final_verify.py`	RAG Phase 4 final verification	No — phase shipped
`test_rag_phase4_stress.py`	RAG Phase 4 stress test	Maybe — stress tests have ongoing value
`test_arch_boundary_phase1.py`	Arch boundary phase 1	No — superseded by phase 2/3
`test_arch_boundary_phase2.py`	Arch boundary phase 2	Maybe — regression guard
`test_arch_boundary_phase3.py`	Arch boundary phase 3	Maybe — regression guard
`test_code_path_audit_phase78.py`	Code path audit phases 7-8	No — audit closed
`test_code_path_audit_phase89.py`	Code path audit phases 8-9	No — audit closed
`test_context_composition_phase3.py`	Context composition phase 3	No — superseded
`test_context_composition_phase4.py`	Context composition phase 4	No — superseded
`test_context_composition_phase6.py`	Context composition phase 6	No — superseded (3 of 4 tests skipped anyway)
`test_gui_phase3.py`	GUI phase 3	No — superseded
`test_gui_phase4.py`	GUI phase 4	No — superseded
`test_metadata_promotion_phase1.py`	Metadata promotion phase 1	Maybe — regression guard
`test_mma_agent_focus_phase1.py`	MMA agent focus phase 1	No — superseded by phase 3
`test_mma_agent_focus_phase3.py`	MMA agent focus phase 3	Maybe — regression guard
`test_phase6_engine.py`	Phase 6 engine	Maybe — regression guard
`test_phase6_simulation.py`	Phase 6 simulation	Maybe — regression guard
`test_fixes_20260517.py`	Fixes from May 17	No — one-shot fix verification
`test_project_context_20260627.py`	Project context (dated)	Maybe — recent enough to keep

Assessment: ~12-14 of 20 one-shot phase tests are cruft. The ones marked "Maybe" are regression guards for features that could still break; the "No" ones are verifying completed phases that won't regress unless someone reverts the feature.

2.4 Potentially redundant test clusters

Cluster	Files	Issue
History	`test_history.py`, `test_history_management.py`, `test_history_manager.py`, `test_history_message.py`, `test_orchestrator_pm_history.py`	5 files for history; likely overlapping coverage
Theme	`test_theme.py`, `test_theme_2_no_top_level_nerv.py`, `test_theme_models.py`, `test_theme_nerv.py`, `test_theme_nerv_alert.py`, `test_theme_nerv_fx.py`	6 files for theme; some are import-tests, some are functional
Markdown table	`test_markdown_table.py`, `test_markdown_table_columns.py`, `test_markdown_table_render.py`, `test_markdown_table_wrapped.py`, `test_markdown_helper_no_top_level_table.py`	5 files for markdown tables; likely overlapping
Audit scripts	10 `test_audit_*.py` files	Each tests a different audit script; not redundant, but heavy for "test the tests"

2.5 Import-only tests (structural but low value)

10+ test_*_no_top_level_*.py files test that specific modules don't import heavy dependencies at module level. These were critical during the startup_speedup campaign but are now regression guards. They're cheap to run (unit tier) but add to the 245-file core batch.

3. Test Engine Upgrade Opportunities

3.1 Tests that would benefit from the ImGui Test Engine (high-value upgrades)

These are tests where the current Hook API cannot express the interaction being tested, or where time.sleep makes them fragile. The test engine's ctx.dock_into, ctx.window_focus, ctx.window_resize, ctx.item_click, ctx.capture_screenshot_window would replace the current Puppeteer-style approach.

Test file	What it tests	Test engine primitive that upgrades it
`test_workspace_profiles_sim.py`	Save/restore docking layout via `show_windows` dict + `save_workspace_profile` callback	`ctx.dock_into` + `ctx.window_focus` + `ctx.capture_screenshot_window` for visual regression
`test_auto_switch_sim.py`	Auto-switch workspace profile based on MMA tier	`ctx.dock_into` + `ctx.window_focus` + state assertion via `ctx.item_info`
`test_task_dag_popout_sim.py`	Pop-out panel to standalone viewport	`ctx.window_focus` + `ctx.window_info` (is it docked or floating?)
`test_usage_analytics_popout_sim.py`	Pop-out usage analytics panel	Same as above
`test_preset_windows_layout.py`	Preset window layout restoration	`ctx.dock_into` + `ctx.capture_screenshot_window`
`test_gui_text_viewer.py`	Text viewer rendering + docking	`ctx.window_focus` + `ctx.scroll_to_item` + `ctx.capture_screenshot_window`
`test_gui_context_presets.py`	Context preset panel interactions	`ctx.item_click` + `ctx.item_check`
`test_tool_management_layout.py`	Tool management panel layout	`ctx.item_click` + `ctx.window_info`
`test_selectable_ui.py`	Selectable text rendering	`ctx.item_click` + `ctx.item_info` (is the item selectable?)
`test_command_palette_sim.py`	Command palette open + search + select	`ctx.key_chars` + `ctx.item_click` + `ctx.key_press` (Enter)
`test_undo_redo_sim.py`	Undo/redo workspace state	`ctx.key_press` (Ctrl+Z, Ctrl+Y) + state assertion
`test_mma_step_mode_sim.py`	MMA step mode approval flow	`ctx.item_click` (approve button) + `ctx.item_info` (is it enabled?)
`test_mma_concurrent_tracks_sim.py`	Concurrent track execution UI	`ctx.item_click` + `ctx.window_info` (stream visibility)
`test_visual_mma.py`	MMA visual dashboard	`ctx.capture_screenshot_window` + `ctx.item_click`
`test_visual_orchestration.py`	Visual orchestration panel	`ctx.capture_screenshot_window`
`test_visual_sim_gui_ux.py`	GUI UX event routing + performance	`ctx.item_click` + `ctx.capture_screenshot_window`
`test_visual_sim_mma_v2.py`	MMA visual simulation v2	`ctx.capture_screenshot_window`
`test_z_negative_flows.py`	Negative flow handling	`ctx.item_click` + `ctx.item_info` (error state)
`test_live_markdown_render.py`	Live markdown rendering	`ctx.capture_screenshot_window` + `ctx.scroll_to_item`
`test_live_workflow.py`	Live workflow end-to-end	`ctx.item_click` + `ctx.key_chars` + `ctx.capture_screenshot_window`
`test_reset_session_clears_mma_and_rag.py`	Session reset clears state	`ctx.item_click` (reset button) + state assertion
`test_saved_presets_sim.py`	Preset save/load	`ctx.item_click` + `ctx.item_info`
`test_system_prompt_sim.py`	System prompt switching	`ctx.item_click` + `ctx.item_info`
`test_gui_stress_performance.py`	Stress performance	`ctx.capture_screenshot_window` (visual regression under load)
`test_gui_performance_requirements.py`	Performance requirements	`ctx.capture_screenshot_window` + FPS check
`test_gui_startup_smoke.py`	Startup smoke test	`ctx.window_info` (is the main window visible?)
`test_hooks.py`	Hook API integration	`ctx.item_click` + state assertion (replace `time.sleep` polling)

Total: 27 live_gui tests are high-value upgrade candidates. That's 47% of the 58 live_gui tests.

3.2 Tests that are fine with the current Hook API (low-value upgrades)

These tests use the Hook API for state mutation + assertion, which the test engine doesn't improve on:

Test pattern	Count	Why the Hook API is sufficient
Provider tests (gemini, deepseek, minimax, grok, qwen, llama)	~8	These test API responses, not UI
API hook endpoint tests	~6	These test the HTTP endpoints themselves
MMA model/logic tests	~10	These test data structures, not rendering
RAG engine tests	~5	These test the search engine, not UI
Import/audit tests	~15	These are AST/import checks, no UI

Total: ~44 tests are fine as-is. The test engine adds no value for pure-logic or pure-API tests.

3.3 Tests that become possible ONLY with the test engine (new capabilities)

These interactions are impossible with the current Hook API and would be new tests enabled by the test engine:

Capability	Test engine primitive	What it enables
Drag-and-drop docking	`ctx.dock_into(src, dst, dir)`	Test that dragging the Context Hub into the Session Hub docks correctly
Window focus order	`ctx.window_focus(ref)`	Test that clicking a panel brings it to front
Window resize	`ctx.window_resize(ref, sz)`	Test that resizing a panel doesn't break rendering
Keyboard shortcuts	`ctx.key_press(KeyChord)`	Test Ctrl+Z/Ctrl+Y, Ctrl+Shift+P (command palette), etc.
Tab close	`ctx.tab_close(ref)`	Test that closing a discussion tab removes it
Table column resize	`ctx.table_resize_column(ref, col, width)`	Test that resizing a table column persists
Screenshot diff	`ctx.capture_screenshot_window(ref)`	Visual regression: compare rendering across commits
Item hover	`ctx.item_hold(ref, time)`	Test tooltip behavior
Multi-step input	`ctx.key_chars("text")` + `ctx.key_press(Enter)`	Test form input flows
Tree open/close	`ctx.item_open_all(ref)`	Test tree expansion behavior

4. Proposed Ordering Taxonomy (Assertion-Criticality-Based)

The problem with the current ordering

The current 2-level sort (fixture class → batch group) has no notion of what a test actually asserts. A test that verifies "the app starts" runs in the same batch as a test that verifies "the MMA DAG engine handles transitive blocking." If the former fails, the latter's result is meaningless (the app is broken, so everything downstream is suspect). But the batch runner reports them as independent pass/fail.

Proposed taxonomy: 3 dimensions

Dimension 1: Assertion criticality (the new ordering key)

Level	Name	Description	Examples
C0	Smoke	"Does the app start and respond?"	Hook server health, GUI startup, basic import
C1	Structural	"Do the core subsystems exist and have the right shape?"	Dataclass field checks, type alias existence, model import
C2	Behavioral	"Do the core subsystems behave correctly in isolation?"	DAG engine cycle detection, history push/undo, error handling Result[T]
C3	Integration	"Do subsystems compose correctly?"	Hook API → controller → GUI task dispatch, AI client → provider dispatch
C4	UI/Visual	"Does the GUI render correctly and respond to user input?"	Docking, focus, panel visibility, command palette, undo/redo
C5	Stress/Perf	"Does it hold up under load?"	Concurrent tracks, stress performance, batch resilience

Dimension 2: Fixture class (the existing tiering — retained)

Fixture	Current tier	Maps to
`unit`	1	C1 + C2 (mostly)
`mock_app`	2	C2 + C3 (mostly)
`live_gui`	3	C3 + C4 + C5 (mixed!)
`headless`	H	C3
`performance`	P	C5
`opt_in`	0	C0 (clean install)

Key insight: the live_gui tier (58 tests) is currently a single monolithic batch that mixes C0 (smoke), C3 (integration), C4 (UI/visual), and C5 (stress) tests. Splitting it by criticality would allow:

Running C0 smoke first (fast fail if the app is broken)
Running C4 UI tests with the test engine (slower but high-fidelity)
Running C5 stress tests last (only if C0-C4 pass)

Dimension 3: Subsystem (the existing batch group — retained)

The core/gui/mma/comms/headless grouping stays. It's useful for targeted runs ("just run the gui tests").

Proposed ordering: (criticality, fixture, subsystem)

C0-smoke (any fixture)       ← run first; fast fail
  C1-structural (unit)        ← run second; cheap
    C2-behavioral (unit)       ← run third; still cheap
      C2-behavioral (mock_app) ← run fourth; mocked integration
        C3-integration (mock_app) ← run fifth
          C3-integration (live_gui) ← run sixth; real subprocess
            C4-ui (live_gui + test_engine) ← run seventh; high-fidelity
              C5-stress (live_gui) ← run last; only if C0-C4 pass
                C5-perf (performance) ← run last; opt-in

How this maps to the batched runner

The categorizer.py would gain a criticality: Criticality field on CategoryRecord. The batcher.py would sort by (criticality, fixture_class, batch_group) instead of (fixture_class, batch_group). The test_categories.toml registry would allow manual override of criticality for specific tests.

The run_tests_batched.py --plan output would show the criticality level:

[RUN] C0-smoke-any:      3 files, est 2s     ← hook health, GUI startup
[RUN] C1-structural-core: 45 files, est 22s  ← dataclass + type checks
[RUN] C2-behavioral-core:  120 files, est 60s ← logic tests
[RUN] C2-behavioral-gui:   15 files, est 8s
[RUN] C3-integration-mma:  12 files, est 15s
[RUN] C4-ui-gui:           27 files, est 40s  ← test engine tests (post-integration)
[RUN] C5-stress-gui:       5 files, est 20s

Migration path

Phase 1: Add the criticality field to CategoryRecord + auto-inference rules. Default all existing tests to C2 (behavioral) — the current median. Manual overrides in test_categories.toml for C0, C1, C3, C4, C5 tests.
Phase 2: Update batcher.py to sort by (criticality, fixture_class, batch_group).
Phase 3: Curate the criticality assignments — audit each test, assign the correct level. This is the bulk of the work; can be done incrementally.
Phase 4 (post-test-engine): Re-classify the 27 test-engine-upgrade candidates as C4-ui. The test engine enables higher-fidelity assertions for these.

5. Summary

Category	Count	Action
Total test files	393	—
Skip markers	6 (4 same root cause)	Mock Gemini API in `summarize.summarise_file` → eliminates 4 skips
`time.sleep` users	60 (38 live_gui)	Replace with poll loops (Hook API) or `ctx.wait_for_test_results` (test engine)
One-shot phase tests (cruft)	~12-14	Delete or consolidate into regression suites
Redundant clusters	3 clusters (history: 5, theme: 6, markdown: 5)	Audit for overlap; consolidate
Test engine upgrade candidates	27 live_gui tests	Migrate after `test_engine_integration_20260627` ships
Tests fine as-is	~44 live_gui + all unit/mock_app	No change needed
New tests enabled by test engine	~10 capabilities	Docking, focus, resize, keyboard, screenshots
`core` batch (245 files, 62%)	245	Split by criticality for targeted verification

Recommended track sequence

test_engine_integration_20260627 (initialized) — build the bridge
test_suite_cruft_cleanup_<date> (new) — delete one-shot cruft, fix Gemini 503 skips, consolidate redundant clusters, replace time.sleep with poll loops
test_ordering_taxonomy_<date> (new) — add the criticality dimension to the batched runner
test_engine_migration_<date> (Campaign A Track 2) — migrate the 27 high-value live_gui tests to the test engine

19 KiB Raw Blame History