diff --git a/conductor/tracks/test_batching_refactor_20260606/plan.md b/conductor/tracks/test_batching_refactor_20260606/plan.md new file mode 100644 index 00000000..8b68deee --- /dev/null +++ b/conductor/tracks/test_batching_refactor_20260606/plan.md @@ -0,0 +1,1756 @@ +# Test Batching Refactor — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace alphabetical 4-at-a-time test batching in `scripts/run_tests_batched.py` with a fixture-class-isolated, subsystem-grouped, xdist-accelerated 6-tier model. Hybrid auto-infer + registry classification. Opt-in per-test order via conftest plugin. + +**Architecture:** Three new library modules under `scripts/`: `test_categorizer.py` (pure classifier), `test_batcher.py` (pure scheduler), `pytest_collection_order.py` (conftest-loaded plugin for opt-in per-test order). One CLI orchestrator replacing the current `run_tests_batched.py` (slim, delegates everything). One hand-curated registry `tests/test_categories.toml` for cross-cutting and ambiguous files. Process isolation enforced by spawning one pytest subprocess per `(tier, batch_group)` pair. + +**Tech Stack:** Python 3.11+ (stdlib `tomllib`, `re`, `pathlib`, `dataclasses`, `enum`, `ast`), pytest, pytest-xdist (optional, `--no-xdist` to skip), tomli-w (NOT needed; we only read). **1-space indentation mandatory.** No comments in production code. + +**Reference:** See `conductor/tracks/test_batching_refactor_20260606/spec.md` for the full design, data model, auto-inference rules, and CLI surface. + +--- + +## File Structure + +| File | Action | Responsibility | +|---|---|---| +| `scripts/test_categorizer.py` | Create | Pure classifier: enums, dataclass, `auto_classify`, `load_registry`, `merge_registry`, `categorize_all` | +| `scripts/test_batcher.py` | Create | Pure scheduler: `Batch` dataclass, `plan(records, options) -> list[Batch]` | +| `scripts/pytest_collection_order.py` | Create | Conftest plugin: `pytest_collection_modifyitems` hook sorts items by `order` index from registry | +| `scripts/run_tests_batched.py` | Modify (Phase 1: stub `--plan`/`--audit`; Phase 3: full CLI) | CLI orchestrator: parse args, call `categorize_all`, dispatch to `plan` + `subprocess.run` | +| `tests/test_categorizer.py` | Create | Unit tests for the categorizer (11+ tests) | +| `tests/test_batcher.py` | Create | Unit tests for the batcher (5+ tests) | +| `tests/test_pytest_collection_order.py` | Create | Unit tests for the plugin (2+ tests) | +| `tests/test_categories.toml` | Create (Phase 4 content) | Hand-curated registry; empty/optional in Phases 1-3 | +| `tests/conftest.py` | Modify | Register `scripts.pytest_collection_order` via `pytest_plugins` | +| `docs/guide_testing.md` | Modify (Phase 3) | Update "Running Tests" section to reference new script | +| `pyproject.toml` | Modify (Phase 4) | No structural change; verify markers list | +| `.gitignore` | Modify (Phase 4) | Add `tests/.test_durations.json` | +| `scripts/run_tests_batched.py.legacy` | Create (Phase 3) → Delete (Phase 4) | Old script preserved for one cycle | + +--- + +# Phase 1: Library + dry-run + +> Goal: All new library code exists with passing tests. The new `run_tests_batched.py` only has `--plan` and `--audit` modes (no actual test execution). The old `run_tests_batched.py` is untouched. The conftest plugin is wired and is a no-op (no `[[test_order]]` entries exist yet). + +--- + +## Task 1.1: Add data model types to scripts/test_categorizer.py + +**Files:** +- Create: `scripts/test_categorizer.py` + +- [ ] **Step 1: Create the file with enums and CategoryRecord dataclass** + +```python +from dataclasses import dataclass, field +from enum import Enum + +class FixtureClass(str, Enum): + UNIT = "unit" + MOCK_APP = "mock_app" + LIVE_GUI = "live_gui" + HEADLESS = "headless" + OPT_IN = "opt_in" + PERFORMANCE = "performance" + +class Speed(str, Enum): + FAST = "fast" + MEDIUM = "medium" + SLOW = "slow" + VERY_SLOW = "very_slow" + +@dataclass(frozen=True) +class CategoryRecord: + filename: str + fixture_class: FixtureClass + subsystems: list[str] + speed: Speed + batch_group: str + notes: str = "" + test_order: dict[str, int] = field(default_factory=dict) + source: str = "auto" + warnings: list[str] = field(default_factory=list) +``` + +- [ ] **Step 2: Verify the file is importable** + +Run: `uv run python -c "from scripts.test_categorizer import CategoryRecord, FixtureClass, Speed; print(CategoryRecord(filename='x', fixture_class=FixtureClass.UNIT, subsystems=['core'], speed=Speed.FAST, batch_group='core'))"` +Expected: prints a `CategoryRecord(filename='x', ...)` line with no errors. + +- [ ] **Step 3: Commit** + +```bash +git add scripts/test_categorizer.py +git commit -m "feat(scripts): add CategoryRecord data model for test categorization" +``` + +--- + +## Task 1.2: Write red tests for auto_classify fixture_class rules + +**Files:** +- Create: `tests/test_categorizer.py` + +- [ ] **Step 1: Create test file with 6 failing tests covering fixture_class auto-inference** + +```python +from pathlib import Path +import pytest +from scripts.test_categorizer import FixtureClass, Speed, auto_classify + +def _write(tmp_path: Path, name: str, content: str) -> Path: + p = tmp_path / name + p.write_text(content, encoding="utf-8") + return p + +def test_auto_classify_clean_install_filename(tmp_path: Path) -> None: + p = _write(tmp_path, "test_clean_install.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.OPT_IN + +def test_auto_classify_docker_build_filename(tmp_path: Path) -> None: + p = _write(tmp_path, "test_docker_build.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.OPT_IN + +def test_auto_classify_live_gui_fixture_in_source(tmp_path: Path) -> None: + p = _write(tmp_path, "test_x.py", "def test_x(live_gui): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.LIVE_GUI + +def test_auto_classify_mock_app_fixture_in_source(tmp_path: Path) -> None: + p = _write(tmp_path, "test_x.py", "def test_x(mock_app): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.MOCK_APP + +def test_auto_classify_perf_keyword_in_filename(tmp_path: Path) -> None: + p = _write(tmp_path, "test_xyz_stress.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.PERFORMANCE + +def test_auto_classify_default_to_unit(tmp_path: Path) -> None: + p = _write(tmp_path, "test_command_palette.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.fixture_class == FixtureClass.UNIT +``` + +- [ ] **Step 2: Run the tests, confirm all 6 fail** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: All 6 tests FAIL with `AttributeError: module 'scripts.test_categorizer' has no attribute 'auto_classify'`. + +- [ ] **Step 3: Commit (red)** + +```bash +git add tests/test_categorizer.py +git commit -m "test(categorizer): add red tests for auto_classify fixture_class rules" +``` + +--- + +## Task 1.3: Implement auto_classify fixture_class rules + +**Files:** +- Modify: `scripts/test_categorizer.py` + +- [ ] **Step 1: Add the auto_classify function with fixture_class rules** + +Append to `scripts/test_categorizer.py`: + +```python +import re +from pathlib import Path + +_OPT_IN_PATTERN = re.compile(r"^test_(clean_install|docker_build)") +_LIVE_GUI_PATTERN = re.compile(r"\(live_gui\)\s*[:,)]") +_MOCK_APP_PATTERN = re.compile(r"\b(mock_app|app_instance)\b") +_PERF_KEYWORDS = ("perf", "stress", "phase_3_final", "phase_4_stress") + +def _classify_fixture_class(path: Path, source: str) -> FixtureClass: + name = path.name + if _OPT_IN_PATTERN.match(name): + return FixtureClass.OPT_IN + if _LIVE_GUI_PATTERN.search(source): + return FixtureClass.LIVE_GUI + if _MOCK_APP_PATTERN.search(source): + return FixtureClass.MOCK_APP + lowered = name.lower() + for kw in _PERF_KEYWORDS: + if kw in lowered: + return FixtureClass.PERFORMANCE + return FixtureClass.UNIT + +def auto_classify(path: Path, durations: dict[str, float] | None = None) -> CategoryRecord: + source = path.read_text(encoding="utf-8", errors="replace") + fixture_class = _classify_fixture_class(path, source) + return CategoryRecord( + filename=path.name, + fixture_class=fixture_class, + subsystems=[], + speed=Speed.MEDIUM, + batch_group="", + source="auto", + ) +``` + +- [ ] **Step 2: Run the 6 tests, confirm they pass** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: All 6 tests PASS. + +- [ ] **Step 3: Commit (green)** + +```bash +git add scripts/test_categorizer.py +git commit -m "feat(categorizer): implement auto_classify fixture_class rules" +``` + +--- + +## Task 1.4: Write red tests for subsystem, speed, and batch_group inference + +**Files:** +- Modify: `tests/test_categorizer.py` + +- [ ] **Step 1: Append 4 failing tests for the new inference functions** + +```python +def test_subsystem_inference_known_prefix(tmp_path: Path) -> None: + p = _write(tmp_path, "test_mcp_client_foo.py", "def test_x(): pass\n") + r = auto_classify(p) + assert "mcp" in r.subsystems + +def test_speed_inference_from_durations_fast(tmp_path: Path) -> None: + p = _write(tmp_path, "test_x.py", "def test_x(): pass\n") + durations = {f"{p.name}::test_x": 0.05} + r = auto_classify(p, durations=durations) + assert r.speed == Speed.FAST + +def test_speed_default_medium_without_durations(tmp_path: Path) -> None: + p = _write(tmp_path, "test_x.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.speed == Speed.MEDIUM + +def test_batch_group_inference_gui_subsystem(tmp_path: Path) -> None: + p = _write(tmp_path, "test_gui_layout_foo.py", "def test_x(): pass\n") + r = auto_classify(p) + assert r.batch_group == "gui" +``` + +- [ ] **Step 2: Run, confirm 4 new tests fail (and previous 6 still pass)** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: 4 new tests FAIL with `AssertionError` on `subsystems`/`speed`/`batch_group` (which are currently empty/default). + +- [ ] **Step 3: Commit (red)** + +```bash +git add tests/test_categorizer.py +git commit -m "test(categorizer): add red tests for subsystem/speed/batch_group inference" +``` + +--- + +## Task 1.5: Implement subsystem, speed, and batch_group inference + +**Files:** +- Modify: `scripts/test_categorizer.py` + +- [ ] **Step 1: Add the constants and inference functions, wire them into auto_classify** + +```python +_SUBSYSTEM_PREFIXES = ( + "ai", "api", "arch", "ast", "async", "auto", "beads", "bias", "cache", + "cli", "cmd", "comms", "conductor", "context", "cost", "dag", "deepseek", + "diff", "discussion", "event", "execution", "external", "ext", "fuzzy", + "gemini", "gui", "headless", "history", "hooks", "hot", "imgui", "layout", + "live", "log", "mcp", "markdown", "minimax", "mma", "model", "orchestrator", + "outline", "parallel", "patch", "perf", "persona", "phase", "pipeline", + "preset", "prior", "process", "project", "provider", "rag", "script", + "session", "shader", "sim", "skeleton", "slice", "spawn", "status", + "subagent", "summary", "symbol", "sync", "synthesis", "system", "takes", + "theme", "thinking", "ticket", "tier4", "tiered", "token", "tool", "track", + "tree", "ts", "undo", "usage", "user", "vendor", "view", "visual", + "vlogger", "websocket", "workflow", "workspace", "z", +) + +_BATCH_GROUP_CLUSTERS: dict[str, tuple[str, ...]] = { + "core": ( + "mcp", "ai", "context", "api", "dag", "path", "presets", "personas", + "history", "workspace", "rag", "beads", "model", "ast", "async", "cache", + "cli", "cmd", "fuzzy", "hooks", "log", "markdown", "orchestrator", + "outline", "pipeline", "project", "provider", "script", "session", + "skeleton", "slice", "spawn", "status", "subagent", "summary", "symbol", + "sync", "synthesis", "system", "takes", "thinking", "tier4", "tiered", + "tool", "track", "tree", "ts", "usage", "vendor", "vlogger", "websocket", + "workflow", + ), + "gui": ("gui", "theme", "imgui", "layout", "live", "prior", "visual", "view", "undo"), + "mma": ("mma", "conductor", "execution", "ext", "external", "auto", "manual", "tier", "arch", "phase", "process", "z"), + "comms": ("comms", "diff", "patch", "event", "hot", "process", "shader"), + "headless": ("headless",), +} + +def _infer_subsystems(filename: str) -> list[str]: + stem = filename.removeprefix("test_").removesuffix(".py") + for prefix in sorted(_SUBSYSTEM_PREFIXES, key=len, reverse=True): + if stem.startswith(prefix + "_") or stem == prefix: + return [prefix] + return [] + +def _infer_batch_group(subsystems: list[str]) -> str: + if not subsystems: + return "core" + first = subsystems[0] + for group, members in _BATCH_GROUP_CLUSTERS.items(): + if first in members: + return group + return "core" + +def _infer_speed(filename: str, durations: dict[str, float] | None) -> Speed: + if not durations: + return Speed.MEDIUM + matching = [v for k, v in durations.items() if k.startswith(filename + "::")] + if not matching: + return Speed.MEDIUM + p95 = sorted(matching)[int(len(matching) * 0.95)] + if p95 < 1.0: + return Speed.FAST + if p95 < 5.0: + return Speed.MEDIUM + if p95 < 30.0: + return Speed.SLOW + return Speed.VERY_SLOW + +def auto_classify(path: Path, durations: dict[str, float] | None = None) -> CategoryRecord: + source = path.read_text(encoding="utf-8", errors="replace") + fixture_class = _classify_fixture_class(path, source) + subsystems = _infer_subsystems(path.name) + speed = _infer_speed(path.name, durations) + batch_group = _infer_batch_group(subsystems) + return CategoryRecord( + filename=path.name, + fixture_class=fixture_class, + subsystems=subsystems, + speed=speed, + batch_group=batch_group, + source="auto", + ) +``` + +- [ ] **Step 2: Run all 10 tests, confirm they pass** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: All 10 tests PASS. + +- [ ] **Step 3: Commit (green)** + +```bash +git add scripts/test_categorizer.py +git commit -m "feat(categorizer): implement subsystem/speed/batch_group inference" +``` + +--- + +## Task 1.6: Write red tests for merge_registry and categorize_all + +**Files:** +- Modify: `tests/test_categorizer.py` + +- [ ] **Step 1: Append 3 failing tests for registry merge and full classification** + +```python +import tomllib +from scripts.test_categorizer import load_registry, merge_registry, categorize_all + +def test_load_registry_returns_dict(tmp_path: Path) -> None: + toml = tmp_path / "reg.toml" + toml.write_text('[files.test_x]\nfixture_class = "mock_app"\n', encoding="utf-8") + reg = load_registry(toml) + assert "test_x" in reg + assert reg["test_x"]["fixture_class"] == "mock_app" + +def test_merge_registry_overrides_auto(tmp_path: Path) -> None: + p = _write(tmp_path, "test_x.py", "def test_x(): pass\n") + auto = auto_classify(p) + assert auto.fixture_class == FixtureClass.UNIT + reg_entry = {"fixture_class": "mock_app", "subsystems": ["x"], "speed": "fast", "batch_group": "x"} + merged = merge_registry(auto, reg_entry) + assert merged.fixture_class == FixtureClass.MOCK_APP + assert merged.source == "registry" + assert "subsystems-override" in " ".join(merged.warnings) or merged.subsystems == ["x"] + +def test_categorize_all_handles_real_tests_dir(tmp_path: Path) -> None: + (tmp_path / "test_a.py").write_text("def test_x(): pass\n", encoding="utf-8") + (tmp_path / "test_b_sim.py").write_text("def test_x(live_gui): pass\n", encoding="utf-8") + reg_path = tmp_path / "reg.toml" + reg_path.write_text("", encoding="utf-8") + records = categorize_all(tmp_path, reg_path) + assert len(records) == 2 + by_name = {r.filename: r for r in records} + assert by_name["test_a.py"].fixture_class == FixtureClass.UNIT + assert by_name["test_b_sim.py"].fixture_class == FixtureClass.LIVE_GUI +``` + +- [ ] **Step 2: Run, confirm 3 new tests fail** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: 3 new tests FAIL with `ImportError` (or `AttributeError` on `load_registry`/`merge_registry`/`categorize_all`). + +- [ ] **Step 3: Commit (red)** + +```bash +git add tests/test_categorizer.py +git commit -m "test(categorizer): add red tests for registry merge and full classification" +``` + +--- + +## Task 1.7: Implement load_registry, merge_registry, categorize_all + +**Files:** +- Modify: `scripts/test_categorizer.py` + +- [ ] **Step 1: Add the three new functions** + +```python +def load_registry(toml_path: Path) -> dict[str, dict]: + if not toml_path.exists(): + return {} + with toml_path.open("rb") as f: + data = tomllib.load(f) + return data.get("files", {}) + +def merge_registry(auto: CategoryRecord, entry: dict) -> CategoryRecord: + warnings = list(auto.warnings) + if "fixture_class" in entry and entry["fixture_class"] != auto.fixture_class.value: + warnings.append(f"fixture_class-override: {auto.fixture_class.value} -> {entry['fixture_class']}") + if "subsystems" in entry and set(entry["subsystems"]) != set(auto.subsystems): + warnings.append(f"subsystems-override: {auto.subsystems} -> {entry['subsystems']}") + return CategoryRecord( + filename=auto.filename, + fixture_class=FixtureClass(entry.get("fixture_class", auto.fixture_class.value)), + subsystems=list(entry.get("subsystems", auto.subsystems)), + speed=Speed(entry.get("speed", auto.speed.value)), + batch_group=entry.get("batch_group", auto.batch_group), + notes=entry.get("notes", auto.notes), + test_order=dict(auto.test_order), + source="registry", + warnings=warnings, + ) + +def categorize_all(tests_dir: Path, registry_path: Path) -> list[CategoryRecord]: + registry = load_registry(registry_path) + records: list[CategoryRecord] = [] + for path in sorted(tests_dir.glob("test_*.py")): + auto = auto_classify(path) + entry = registry.get(path.name, {}) + if entry: + records.append(merge_registry(auto, entry)) + else: + records.append(auto) + return records +``` + +- [ ] **Step 2: Run all 13 tests, confirm they pass** + +Run: `uv run pytest tests/test_categorizer.py -v` +Expected: All 13 tests PASS. + +- [ ] **Step 3: Commit (green)** + +```bash +git add scripts/test_categorizer.py +git commit -m "feat(categorizer): implement load_registry, merge_registry, categorize_all" +``` + +--- + +## Task 1.8: Smoke-test categorize_all on the real tests/ directory + +**Files:** +- Create: `tests/test_categorizer_smoke.py` (ephemeral, deleted in this task) + +- [ ] **Step 1: Run the categorizer against the real tests/ dir and inspect output** + +Run: +```bash +uv run python -c " +from pathlib import Path +from scripts.test_categorizer import categorize_all, FixtureClass +records = categorize_all(Path('tests'), Path('tests/test_categories.toml')) +print(f'Total: {len(records)}') +from collections import Counter +fc_counts = Counter(r.fixture_class for r in records) +for fc, n in sorted(fc_counts.items(), key=lambda x: x[0].value): + print(f' {fc.value}: {n}') +print('Subsystem distribution (top 10):') +sub_counts = Counter() +for r in records: + for s in r.subsystems: + sub_counts[s] += 1 +for s, n in sub_counts.most_common(10): + print(f' {s}: {n}') +" +``` + +Expected: prints `Total: 277` (or current count) and a tier breakdown. Confirm: +- `opt_in: 2` (test_clean_install + test_docker_build) +- `live_gui: 14` (all `*_sim.py` files) +- `unit: ~200+` (the rest) +- A few `mock_app` (files that reference mock_app/app_instance) + +- [ ] **Step 2: Sanity-check a few specific files** + +Run: +```bash +uv run python -c " +from pathlib import Path +from scripts.test_categorizer import categorize_all +records = {r.filename: r for r in categorize_all(Path('tests'), Path('tests/test_categories.toml'))} +for f in ['test_gui_dag_beads.py', 'test_arch_boundary_phase1.py', 'test_mcp_client.py', 'test_command_palette.py']: + r = records[f] + print(f'{f}: fc={r.fixture_class.value}, subs={r.subsystems}, bg={r.batch_group}, speed={r.speed.value}') +" +``` + +Expected: `test_mcp_client.py` → `fc=unit, subs=['mcp'], bg=core`. Other files have sensible values. + +- [ ] **Step 3: Delete the smoke-test scratch file (if any was created) and commit nothing** + +No commit. The smoke test was a one-shot `python -c` invocation. + +--- + +## Task 1.9: Write red tests for scripts/test_batcher.py::plan + +**Files:** +- Create: `tests/test_batcher.py` + +- [ ] **Step 1: Create test file with 5 failing tests for the plan function** + +```python +from pathlib import Path +import pytest +from scripts.test_categorizer import CategoryRecord, FixtureClass, Speed +from scripts.test_batcher import Batch, plan + +def _rec(name: str, fc: FixtureClass, bg: str = "core") -> CategoryRecord: + return CategoryRecord( + filename=name, + fixture_class=fc, + subsystems=[bg] if bg else [], + speed=Speed.MEDIUM, + batch_group=bg, + ) + +def test_plan_groups_unit_by_batch_group() -> None: + records = [ + _rec("test_a.py", FixtureClass.UNIT, "core"), + _rec("test_b.py", FixtureClass.UNIT, "gui"), + _rec("test_c.py", FixtureClass.UNIT, "core"), + ] + batches = plan(records) + unit_batches = [b for b in batches if b.tier == "1"] + labels = {b.label for b in unit_batches} + assert "tier-1-unit-core" in labels + assert "tier-1-unit-gui" in labels + +def test_plan_live_gui_tier_is_one_batch() -> None: + records = [ + _rec("test_a_sim.py", FixtureClass.LIVE_GUI, "gui"), + _rec("test_b_sim.py", FixtureClass.LIVE_GUI, "core"), + _rec("test_c_sim.py", FixtureClass.LIVE_GUI, "mma"), + ] + batches = plan(records) + live_batches = [b for b in batches if b.tier == "3"] + assert len(live_batches) == 1 + assert len(live_batches[0].files) == 3 + +def test_plan_opt_in_skipped_without_flag() -> None: + records = [ + _rec("test_clean_install.py", FixtureClass.OPT_IN), + _rec("test_docker_build.py", FixtureClass.OPT_IN), + ] + batches = plan(records, include_opt_in=False) + opt_batches = [b for b in batches if b.tier == "0"] + assert all(b.skip_reason for b in opt_batches) + +def test_plan_is_deterministic() -> None: + records = [ + _rec("test_a.py", FixtureClass.UNIT, "core"), + _rec("test_b.py", FixtureClass.UNIT, "gui"), + ] + a = plan(records) + b = plan(records) + assert [(x.tier, x.label, len(x.files)) for x in a] == [(y.tier, y.label, len(y.files)) for y in b] + +def test_plan_xdist_only_for_tier_1() -> None: + records = [_rec("test_a.py", FixtureClass.UNIT, "core")] + batches = plan(records, xdist=True) + unit = next(b for b in batches if b.tier == "1") + assert "-n" in unit.pytest_args + assert "auto" in unit.pytest_args + mock_records = [_rec("test_a.py", FixtureClass.MOCK_APP, "core")] + batches2 = plan(mock_records, xdist=True) + mock = next(b for b in batches2 if b.tier == "2") + assert "-n" not in mock.pytest_args +``` + +- [ ] **Step 2: Run, confirm 5 tests fail** + +Run: `uv run pytest tests/test_batcher.py -v` +Expected: All 5 tests FAIL with `ImportError: cannot import name 'Batch' from 'scripts.test_batcher'`. + +- [ ] **Step 3: Commit (red)** + +```bash +git add tests/test_batcher.py +git commit -m "test(batcher): add red tests for plan() function" +``` + +--- + +## Task 1.10: Implement scripts/test_batcher.py + +**Files:** +- Create: `scripts/test_batcher.py` + +- [ ] **Step 1: Create the file with Batch dataclass and plan function** + +```python +from dataclasses import dataclass +from pathlib import Path +from scripts.test_categorizer import CategoryRecord, FixtureClass + +@dataclass(frozen=True) +class Batch: + tier: str + label: str + files: list[Path] + pytest_args: list[str] + estimated_seconds: float + skip_reason: str | None = None + +_TIER_ORDER = ("0", "1", "2", "3", "H", "P") + +def _batches_for_unit(records: list[CategoryRecord], xdist: bool) -> list[Batch]: + by_group: dict[str, list[CategoryRecord]] = {} + for r in records: + by_group.setdefault(r.batch_group or "core", []).append(r) + batches: list[Batch] = [] + for group in sorted(by_group): + files = [Path("tests") / r.filename for r in by_group[group]] + args: list[str] = ["--maxfail=10"] + if xdist: + args = ["-n", "auto"] + args + batches.append(Batch( + tier="1", + label=f"tier-1-unit-{group}", + files=files, + pytest_args=args, + estimated_seconds=sum(_est(r) for r in by_group[group]), + )) + return batches + +def _batches_for_mock_app(records: list[CategoryRecord]) -> list[Batch]: + by_group: dict[str, list[CategoryRecord]] = {} + for r in records: + by_group.setdefault(r.batch_group or "core", []).append(r) + batches: list[Batch] = [] + for group in sorted(by_group): + files = [Path("tests") / r.filename for r in by_group[group]] + batches.append(Batch( + tier="2", + label=f"tier-2-mock_app-{group}", + files=files, + pytest_args=["--maxfail=5"], + estimated_seconds=sum(_est(r) for r in by_group[group]), + )) + return batches + +def _batches_for_live_gui(records: list[CategoryRecord]) -> list[Batch]: + if not records: + return [] + files = [Path("tests") / r.filename for r in records] + return [Batch( + tier="3", + label="tier-3-live_gui", + files=files, + pytest_args=["--maxfail=1"], + estimated_seconds=sum(_est(r) for r in records), + )] + +def _batches_for_headless(records: list[CategoryRecord]) -> list[Batch]: + if not records: + return [] + files = [Path("tests") / r.filename for r in records] + return [Batch( + tier="H", + label="tier-H-headless", + files=files, + pytest_args=["--maxfail=5"], + estimated_seconds=sum(_est(r) for r in records), + )] + +def _batches_for_performance(records: list[CategoryRecord]) -> list[Batch]: + if not records: + return [] + files = [Path("tests") / r.filename for r in records] + return [Batch( + tier="P", + label="tier-P-performance", + files=files, + pytest_args=["--maxfail=1"], + estimated_seconds=sum(_est(r) for r in records), + )] + +def _batches_for_opt_in(records: list[CategoryRecord], include_opt_in: bool) -> list[Batch]: + batches: list[Batch] = [] + for r in records: + files = [Path("tests") / r.filename] + skip_reason: str | None = None + if not include_opt_in: + skip_reason = "--include-opt-in not set" + elif r.filename.startswith("test_clean_install") and not _env_set("RUN_CLEAN_INSTALL_TEST"): + skip_reason = "RUN_CLEAN_INSTALL_TEST not set" + elif r.filename.startswith("test_docker_build") and not _env_set("RUN_DOCKER_TEST"): + skip_reason = "RUN_DOCKER_TEST not set" + batches.append(Batch( + tier="0", + label=f"tier-0-opt_in-{r.filename.removeprefix('test_').removesuffix('.py')}", + files=files, + pytest_args=["--maxfail=1"], + estimated_seconds=_est(r), + skip_reason=skip_reason, + )) + return batches + +import os +def _env_set(name: str) -> bool: + return bool(os.environ.get(name)) + +_SPEED_SECONDS = {"fast": 0.5, "medium": 3.0, "slow": 15.0, "very_slow": 60.0} +def _est(r: CategoryRecord) -> float: + return _SPEED_SECONDS.get(r.speed.value, 3.0) + +def plan( + records: list[CategoryRecord], + *, + tiers: set[str] = set(_TIER_ORDER), + include_opt_in: bool = False, + xdist: bool = True, +) -> list[Batch]: + by_fc: dict[FixtureClass, list[CategoryRecord]] = {fc: [] for fc in FixtureClass} + for r in records: + by_fc[r.fixture_class].append(r) + out: list[Batch] = [] + if "0" in tiers: + out.extend(_batches_for_opt_in(by_fc[FixtureClass.OPT_IN], include_opt_in)) + if "1" in tiers: + out.extend(_batches_for_unit(by_fc[FixtureClass.UNIT], xdist)) + if "2" in tiers: + out.extend(_batches_for_mock_app(by_fc[FixtureClass.MOCK_APP])) + if "3" in tiers: + out.extend(_batches_for_live_gui(by_fc[FixtureClass.LIVE_GUI])) + if "H" in tiers: + out.extend(_batches_for_headless(by_fc[FixtureClass.HEADLESS])) + if "P" in tiers: + out.extend(_batches_for_performance(by_fc[FixtureClass.PERFORMANCE])) + out.sort(key=lambda b: (_TIER_ORDER.index(b.tier), b.label)) + return out +``` + +- [ ] **Step 2: Run all 5 tests, confirm they pass** + +Run: `uv run pytest tests/test_batcher.py -v` +Expected: All 5 tests PASS. + +- [ ] **Step 3: Commit (green)** + +```bash +git add scripts/test_batcher.py +git commit -m "feat(batcher): implement Batch dataclass and plan() function" +``` + +--- + +## Task 1.11: Write red tests for scripts/pytest_collection_order.py + +**Files:** +- Create: `tests/test_pytest_collection_order.py` + +- [ ] **Step 1: Create test file with 2 failing tests for the plugin** + +```python +import textwrap +from pathlib import Path +import pytest + +def test_no_op_without_registry(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.chdir(tmp_path) + (tmp_path / "test_zz.py").write_text( + "def test_b(): pass\ndef test_a(): pass\n", encoding="utf-8" + ) + from scripts.pytest_collection_order import sort_items_by_order + items = [] + result = sort_items_by_order(items, registry_path=tmp_path / "reg.toml") + assert result == items + +def test_sorts_by_order_index(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.chdir(tmp_path) + (tmp_path / "test_zz.py").write_text( + "def test_b(): pass\ndef test_a(): pass\n", encoding="utf-8" + ) + reg = tmp_path / "reg.toml" + reg.write_text( + "[[files.test_zz.test_order]]\n" + "test_id = 'test_zz::test_b'\n" + "order = 1\n" + "[[files.test_zz.test_order]]\n" + "test_id = 'test_zz::test_a'\n" + "order = 2\n", + encoding="utf-8", + ) + class _Item: + def __init__(self, nodeid: str) -> None: + self.nodeid = nodeid + def __repr__(self) -> str: + return f"_Item({self.nodeid!r})" + items = [_Item("test_zz::test_b"), _Item("test_zz::test_a")] + from scripts.pytest_collection_order import sort_items_by_order + result = sort_items_by_order(items, registry_path=reg) + assert [i.nodeid for i in result] == ["test_zz::test_b", "test_zz::test_a"] +``` + +- [ ] **Step 2: Run, confirm 2 tests fail** + +Run: `uv run pytest tests/test_pytest_collection_order.py -v` +Expected: 2 tests FAIL with `ImportError: cannot import name 'sort_items_by_order'`. + +- [ ] **Step 3: Commit (red)** + +```bash +git add tests/test_pytest_collection_order.py +git commit -m "test(collection_order): add red tests for opt-in sort_items_by_order" +``` + +--- + +## Task 1.12: Implement scripts/pytest_collection_order.py + +**Files:** +- Create: `scripts/pytest_collection_order.py` + +- [ ] **Step 1: Create the file with the sort function and pytest hook** + +```python +from pathlib import Path +import tomllib + +def _load_order_map(registry_path: Path) -> dict[str, dict[str, int]]: + if not registry_path.exists(): + return {} + with registry_path.open("rb") as f: + data = tomllib.load(f) + files = data.get("files", {}) + out: dict[str, dict[str, int]] = {} + for fname, entry in files.items(): + order_list = entry.get("test_order", []) + if isinstance(order_list, list): + out[fname] = {item["test_id"]: int(item["order"]) for item in order_list} + elif isinstance(order_list, dict): + out[fname] = {k: int(v) for k, v in order_list.items()} + return out + +def sort_items_by_order(items: list, registry_path: Path) -> list: + order_map = _load_order_map(registry_path) + if not order_map: + return list(items) + by_file: dict[str, list] = {} + for it in items: + nodeid = getattr(it, "nodeid", "") + fname = nodeid.split("::", 1)[0] if "::" in nodeid else "" + by_file.setdefault(fname, []).append(it) + out: list = [] + for fname, group in by_file.items(): + fmap = order_map.get(fname) + if not fmap: + out.extend(group) + continue + def _key(it) -> tuple[int, int]: + nid = getattr(it, "nodeid", "") + idx = fmap.get(nid, 1 << 30) + return (idx, group.index(it)) + out.extend(sorted(group, key=_key)) + return out + +def pytest_collection_modifyitems(config, items) -> None: + tests_dir = Path(getattr(config, "rootdir", Path.cwd())) + registry_path = tests_dir / "test_categories.toml" + new_items = sort_items_by_order(list(items), registry_path=registry_path) + items[:] = new_items +``` + +- [ ] **Step 2: Run, confirm 2 tests pass** + +Run: `uv run pytest tests/test_pytest_collection_order.py -v` +Expected: 2 tests PASS. + +- [ ] **Step 3: Commit (green)** + +```bash +git add scripts/pytest_collection_order.py +git commit -m "feat(collection_order): implement opt-in per-test sort via conftest hook" +``` + +--- + +## Task 1.13: Wire the plugin in tests/conftest.py + +**Files:** +- Modify: `tests/conftest.py:1-30` (read first; this file is 250+ lines) + +- [ ] **Step 1: Read the first 30 lines of conftest.py to understand structure** + +Run: `Read tests/conftest.py (first 30 lines)` via MCP `manual-slop_get_file_slice path=tests/conftest.py start_line=1 end_line=30`. Identify where module-level imports and pytest_plugins are declared. + +- [ ] **Step 2: Add `pytest_plugins` line (or extend if it exists)** + +If `tests/conftest.py` does NOT already have a `pytest_plugins` line, ADD this line near the top (after imports, before any fixtures): + +```python +pytest_plugins = ["scripts.pytest_collection_order"] +``` + +If `pytest_plugins` already exists, APPEND `"scripts.pytest_collection_order"` to the list. + +Use the surgical edit tool (`manual-slop_edit_file` with the appropriate `old_string`/`new_string`) to make this change with **1-space indentation** and no comments. + +- [ ] **Step 3: Run the full test suite to confirm no regressions** + +Run: `uv run pytest tests/ --ignore=tests/test_categorizer_smoke.py -x -q --timeout=60 2>&1 | tail -30` +Expected: most tests pass; the new categorizer + batcher + plugin tests pass; pre-existing failures (if any) match the known baseline from `conductor/tracks/regression_fixes_20260605/`. + +- [ ] **Step 4: Commit** + +```bash +git add tests/conftest.py +git commit -m "test(conftest): register scripts.pytest_collection_order as pytest plugin" +``` + +--- + +## Task 1.14: Implement scripts/run_tests_batched.py with --plan and --audit modes only + +**Files:** +- Modify: `scripts/run_tests_batched.py` (replace contents) + +- [ ] **Step 1: Replace the file with the new orchestrator (Phase 1 stub: no actual pytest execution)** + +```python +import argparse +import sys +from pathlib import Path + +from scripts.test_categorizer import categorize_all +from scripts.test_batcher import plan + +def _print_plan(records, options) -> int: + batches = plan(records, include_opt_in=options.include_opt_in, xdist=not options.no_xdist) + for b in batches: + status = "SKIP" if b.skip_reason else "RUN" + print(f"[{status}] {b.label}: {len(b.files)} files, est {b.estimated_seconds:.1f}s, args={b.pytest_args}") + if b.skip_reason: + print(f" reason: {b.skip_reason}") + return 0 + +def _print_audit(records, strict: bool) -> int: + auto = [r for r in records if r.source == "auto"] + print(f"Auto-inferred (unclassified) records: {len(auto)}") + for r in auto: + print(f" {r.filename}: fc={r.fixture_class.value}, subs={r.subsystems}, bg={r.batch_group}") + if strict: + bad = [r for r in auto if len(r.subsystems) > 1] + if bad: + print(f"STRICT: {len(bad)} auto-inferred files have multiple subsystems (probably cross-cutting):") + for r in bad: + print(f" {r.filename}: subs={r.subsystems}") + return 1 + return 0 + +def main() -> int: + p = argparse.ArgumentParser() + p.add_argument("--tests-dir", default="tests") + p.add_argument("--registry", default="tests/test_categories.toml") + p.add_argument("--tiers", default="1,2,3,H") + p.add_argument("--include-opt-in", action="store_true") + p.add_argument("--no-xdist", action="store_true") + p.add_argument("--plan", action="store_true") + p.add_argument("--audit", action="store_true") + p.add_argument("--strict", action="store_true") + options = p.parse_args() + records = categorize_all(Path(options.tests_dir), Path(options.registry)) + if options.audit: + return _print_audit(records, strict=options.strict) + if options.plan: + return _print_plan(records, options) + print("Phase 1 stub: no actual test execution yet. Use --plan or --audit.") + return 0 + +if __name__ == "__main__": + sys.exit(main()) +``` + +- [ ] **Step 2: Verify --plan output** + +Run: `python scripts/run_tests_batched.py --plan 2>&1 | head -30` +Expected: prints a list of batches with labels like `[RUN] tier-1-unit-core: 42 files, est 126.0s, args=['-n', 'auto', '--maxfail=10']` etc. + +- [ ] **Step 3: Verify --audit output** + +Run: `python scripts/run_tests_batched.py --audit 2>&1 | head -20` +Expected: prints `Auto-inferred (unclassified) records: 275` (or similar; all files are auto-inferred because the registry is empty) followed by per-file lines. + +- [ ] **Step 4: Verify --audit --strict exits non-zero (when cross-cutting auto-classification occurs)** + +Run: `python scripts/run_tests_batched.py --audit --strict 2>&1 | tail -5; echo "exit: $?"` +Expected: prints STRICT violations (if any auto-inferred file has multiple subsystems — likely 0 in Phase 1) and exits 0 OR 1 depending on whether any cross-cutting auto-inferred files exist. + +- [ ] **Step 5: Commit** + +```bash +git add scripts/run_tests_batched.py +git commit -m "feat(run_tests_batched): add --plan and --audit modes (Phase 1 stub)" +``` + +--- + +## Task 1.15: Manually verify --plan output matches the spec + +**Files:** none (manual verification only) + +- [ ] **Step 1: Verify the spec invariants from Section 3.3 of spec.md** + +Run: +```bash +python scripts/run_tests_batched.py --plan 2>&1 | grep -E "tier-(0|1|2|3|H|P)" +``` + +Expected: +- Exactly ONE `tier-3-live_gui` batch (containing all 14 `*_sim.py` files) +- `tier-0-opt_in-clean_install` AND `tier-0-opt_in-docker` both SKIP (no env var) +- `tier-1-unit-*` batches grouped by subsystem batch_group (core, gui, mma, comms, headless) +- `tier-2-mock_app-*` batches (if any tests use mock_app) +- `tier-H-headless` and `tier-P-performance` (if any tests match) + +- [ ] **Step 2: Confirm no live_gui test appears in a non-tier-3 batch** + +Run: +```bash +python scripts/run_tests_batched.py --plan 2>&1 | grep -E "_sim\.py" | grep -v "tier-3" +``` + +Expected: NO output (zero matches). Every `*_sim.py` file is in the tier-3 batch only. + +- [ ] **Step 3: Document the verification in a short note for git history** + +If any invariant is violated, STOP and debug. Otherwise, proceed to commit. No commit is needed in this task; the verification is captured in the Phase 1 checkpoint git note. + +--- + +## Task 1.16: Phase 1 checkpoint commit and git note + +**Files:** none (commit + note only) + +- [ ] **Step 1: Confirm all Phase 1 tests pass** + +Run: `uv run pytest tests/test_categorizer.py tests/test_batcher.py tests/test_pytest_collection_order.py -v` +Expected: All 20 tests (13 + 5 + 2) PASS. + +- [ ] **Step 2: Confirm the full test suite still works (no regressions from plugin wiring)** + +Run: `uv run pytest tests/ --ignore=tests/test_audit_main_thread_imports.py -q --timeout=60 2>&1 | tail -15` +Expected: existing test results match the pre-Phase-1 baseline (any pre-existing failures are unchanged; no NEW failures introduced). + +- [ ] **Step 3: Create the checkpoint commit (if there are uncommitted changes) and attach git note** + +```bash +git add -A +if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 1 complete - library + dry-run modes"; fi +SHA=$(git log -1 --format="%H") +git notes add -m "Phase 1 checkpoint: test_batching_refactor_20260606 + +Library + dry-run complete: +- scripts/test_categorizer.py: 13 tests, all auto-inference rules +- scripts/test_batcher.py: 5 tests, deterministic plan() with 6 tiers +- scripts/pytest_collection_order.py: 2 tests, opt-in per-test sort +- scripts/run_tests_batched.py: --plan and --audit modes (no execution) +- tests/conftest.py: plugin registered (no-op without entries) + +Manually verified: all 14 *_sim.py files are in ONE tier-3 batch. +Opt-in tests SKIP cleanly without env var. + +Next: Phase 2 (shadow run via CI)." "$SHA" +``` + +- [ ] **Step 4: Update state.toml phase_1 status to checkpoint** + +Edit `conductor/tracks/test_batching_refactor_20260606/state.toml` line: +```toml +phase_1 = { status = "in_progress", checkpoint_sha = "", name = "Library + dry-run modes" } +``` +Change to: +```toml +phase_1 = { status = "completed", checkpoint_sha = "", name = "Library + dry-run modes" } +``` + +Then: +```bash +git add conductor/tracks/test_batching_refactor_20260606/state.toml +git commit -m "conductor(plan): mark Phase 1 complete in test_batching_refactor_20260606" +``` + +--- + +# Phase 2: Shadow run + +> Goal: Run the new script in CI as a non-blocking informational job. Compare its pass/fail signature to the old script's. Investigate any divergence. + +--- + +## Task 2.1: Add a CI workflow job for the shadow run + +**Files:** +- Create: `.github/workflows/test_batching_shadow.yml` (or equivalent CI config file in this repo's CI location) + +- [ ] **Step 1: Identify the CI configuration location** + +Check the repo for existing CI files. Look for `.github/workflows/`, `.gitlab-ci.yml`, or similar. If a CI directory exists, follow its conventions. + +- [ ] **Step 2: Create a non-blocking job that runs `python scripts/run_tests_batched.py --plan` and uploads the output as an artifact** + +```yaml +name: test-batching-shadow +on: [push, pull_request] +jobs: + shadow: + runs-on: ubuntu-latest + continue-on-error: true + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.11" + - run: pip install uv + - run: uv sync + - run: python scripts/run_tests_batched.py --plan > plan.txt 2>&1 + - run: python scripts/run_tests_batched.py --audit > audit.txt 2>&1 + - uses: actions/upload-artifact@v4 + with: + name: test-batching-plan + path: | + plan.txt + audit.txt +``` + +Adjust the runner / step names to match the repo's conventions. + +- [ ] **Step 3: Commit and push to a feature branch** + +```bash +git checkout -b test_batching_shadow_ci +git add .github/workflows/test_batching_shadow.yml +git commit -m "ci: add test batching shadow run (informational, non-blocking)" +git push -u origin test_batching_shadow_ci +``` + +- [ ] **Step 4: Open a PR and observe 1+ CI runs** + +(Manual) Verify the shadow job runs and uploads artifacts. Compare `plan.txt` and `audit.txt` to manual expectations. + +--- + +## Task 2.2: Investigate and fix any categorizer/batcher divergence + +**Files:** as needed (likely `scripts/test_categorizer.py` or `scripts/test_batcher.py`) + +- [ ] **Step 1: Compare the shadow job's `plan.txt` against the manually-verified plan from Task 1.15** + +If they match, skip to Task 2.3. If they diverge, identify the source. + +- [ ] **Step 2: Add a regression test for the divergence** + +If a real bug was found, write a failing test that reproduces it BEFORE fixing. + +- [ ] **Step 3: Fix the bug** + +Follow TDD: red → green → commit. + +--- + +## Task 2.3: Phase 2 checkpoint + +**Files:** none (commit + note only) + +- [ ] **Step 1: Confirm the shadow job has been green for at least 1 week of CI runs** + +(Manual) If the shadow job has not been green for 1 week, wait. If 1+ week has passed without divergence, proceed. + +- [ ] **Step 2: Create the checkpoint commit and git note** + +```bash +git add -A +if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 2 complete - shadow run validated"; fi +SHA=$(git log -1 --format="%H") +git notes add -m "Phase 2 checkpoint: shadow run validated. No divergence between new and old scripts over 1+ week of CI runs. Ready to switch default." "$SHA" +``` + +- [ ] **Step 3: Update state.toml phase_2 status** + +Edit `state.toml` phase_2 line: change status to `"completed"`, fill in `checkpoint_sha`. + +```bash +git add conductor/tracks/test_batching_refactor_20260606/state.toml +git commit -m "conductor(plan): mark Phase 2 complete in test_batching_refactor_20260606" +``` + +--- + +# Phase 3: Switch default + +> Goal: Replace the old `run_tests_batched.py` with the new one (with full CLI including `--tiers`, `--include-opt-in`, `--durations` recording). Update `docs/guide_testing.md`. Keep old script as `.legacy` for one cycle. + +--- + +## Task 3.1: Add --tiers, --durations, and full execution logic to run_tests_batched.py + +**Files:** +- Modify: `scripts/run_tests_batched.py` + +- [ ] **Step 1: Extend the script to actually execute pytest per batch** + +Replace the entire contents of `scripts/run_tests_batched.py` with the full version that includes: +- Tier parsing from `--tiers` (e.g., `--tiers 1,2,3`) +- `--durations` flag to record `.test_durations.json` +- Actual `subprocess.run(uv run pytest ...)` per batch +- Summary table output (per Section 5 of spec.md) +- Worst exit code returned + +The full implementation (~150 lines) follows the spec's Section 4.3: + +```python +import argparse +import json +import os +import subprocess +import sys +import time +from pathlib import Path + +from scripts.test_categorizer import categorize_all, FixtureClass +from scripts.test_batcher import plan, Batch + +def _parse_tiers(s: str) -> set[str]: + return {t.strip() for t in s.split(",") if t.strip()} + +def _durations_path(tests_dir: Path) -> Path: + return tests_dir / ".test_durations.json" + +def _load_durations(p: Path) -> dict[str, float]: + if not p.exists(): + return {} + try: + with p.open("r", encoding="utf-8") as f: + return json.load(f) + except (json.JSONDecodeError, OSError): + return {} + +def _save_durations(p: Path, durations: dict[str, float]) -> None: + tmp = p.with_suffix(".json.tmp") + with tmp.open("w", encoding="utf-8") as f: + json.dump(durations, f, indent=2, sort_keys=True) + tmp.replace(p) + +def _parse_durations_from_pytest_output(stdout: str) -> dict[str, float]: + out: dict[str, float] = {} + for line in stdout.splitlines(): + line = line.strip() + if "::" not in line or " " not in line: + continue + parts = line.rsplit(None, 1) + if len(parts) != 2: + continue + nodeid, time_str = parts + try: + out[nodeid] = float(time_str.rstrip("s")) + except ValueError: + continue + return out + +def _run_batch(b: Batch, durations: dict[str, float]) -> tuple[int, float, dict[str, float]]: + if b.skip_reason: + return 0, 0.0, {} + cmd = ["uv", "run", "pytest", "-v", "--durations=0"] + b.pytest_args + [str(f) for f in b.files] + print(f"\n>>> Running {b.label} ({len(b.files)} files)") + t0 = time.monotonic() + proc = subprocess.run(cmd, capture_output=True, text=True) + elapsed = time.monotonic() - t0 + new_durs = _parse_durations_from_pytest_output(proc.stdout) + print(proc.stdout[-2000:] if proc.returncode != 0 else f"<<< {b.label} PASS in {elapsed:.1f}s") + if proc.returncode != 0: + print(f"<<< {b.label} FAIL (exit {proc.returncode}) in {elapsed:.1f}s") + print(proc.stderr[-1000:]) + return proc.returncode, elapsed, new_durs + +def _print_summary(results: list[tuple[Batch, int, float]]) -> int: + print("\n" + "=" * 60) + print("SUMMARY") + print("=" * 60) + worst = 0 + for b, code, elapsed in results: + if b.skip_reason: + status = "SKIPPED" + elif code == 0: + status = "PASS" + else: + status = "FAIL" + worst = max(worst, code) + n = len(b.files) + print(f"[{b.tier}] {b.label:40s} {status:8s} {n} files {elapsed:6.1f}s") + return worst + +def main() -> int: + p = argparse.ArgumentParser() + p.add_argument("--tests-dir", default="tests") + p.add_argument("--registry", default="tests/test_categories.toml") + p.add_argument("--tiers", default="1,2,3,H") + p.add_argument("--include-opt-in", action="store_true") + p.add_argument("--no-xdist", action="store_true") + p.add_argument("--plan", action="store_true") + p.add_argument("--audit", action="store_true") + p.add_argument("--strict", action="store_true") + p.add_argument("--durations", action="store_true", help="Record per-test durations to .test_durations.json") + options = p.parse_args() + tiers = _parse_tiers(options.tiers) + tests_dir = Path(options.tests_dir) + durations_path = _durations_path(tests_dir) + durations = _load_durations(durations_path) + records = categorize_all(tests_dir, Path(options.registry)) + if options.audit: + from scripts.run_tests_batched_helpers import print_audit + return print_audit(records, strict=options.strict) + batches = plan(records, tiers=tiers, include_opt_in=options.include_opt_in, xdist=not options.no_xdist) + if options.plan: + for b in batches: + status = "SKIP" if b.skip_reason else "RUN" + print(f"[{status}] {b.label}: {len(b.files)} files, est {b.estimated_seconds:.1f}s") + return 0 + results: list[tuple[Batch, int, float]] = [] + merged_durations = dict(durations) + for b in batches: + code, elapsed, new_durs = _run_batch(b, merged_durations) + results.append((b, code, elapsed)) + merged_durations.update(new_durs) + if options.durations: + _save_durations(durations_path, merged_durations) + return _print_summary(results) + +if __name__ == "__main__": + sys.exit(main()) +``` + +- [ ] **Step 2: Run --plan to confirm structure** + +Run: `python scripts/run_tests_batched.py --plan --tiers 1,2,3 2>&1 | head -10` +Expected: prints tier-1, tier-2, tier-3 batches (no execution). + +- [ ] **Step 3: Run --tiers 1 to test the actual execution path on a small batch** + +Run: `python scripts/run_tests_batched.py --tiers 1 2>&1 | tail -20` +Expected: runs all unit-tier batches and prints a SUMMARY table. Some tests may fail; that's OK (the script's exit code reflects test pass/fail, not the implementation). + +- [ ] **Step 4: Commit** + +```bash +git add scripts/run_tests_batched.py +git commit -m "feat(run_tests_batched): full CLI with --tiers, --durations, actual pytest execution" +``` + +--- + +## Task 3.2: Rename the old script to .legacy + +**Files:** +- Rename: `scripts/run_tests_batched.py` (the old, pre-Phase-1 version) → `scripts/run_tests_batched.py.legacy` + +- [ ] **Step 1: Recover the old script from git history** + +The old 36-line version was committed at SHA `b7a97374^` (or earlier; check `git log --all --oneline -- scripts/run_tests_batched.py | head -5`). Recover it: + +```bash +git log --oneline --all -- scripts/run_tests_batched.py | head -5 +git show :scripts/run_tests_batched.py > scripts/run_tests_batched.py.legacy +``` + +- [ ] **Step 2: Verify the .legacy file matches the original 36-line version** + +Run: `Get-Content scripts/run_tests_batched.py.legacy | Measure-Object -Line` (should be 36 lines). + +- [ ] **Step 3: Commit** + +```bash +git add scripts/run_tests_batched.py.legacy +git commit -m "chore: preserve old run_tests_batched.py as .legacy for one cycle" +``` + +--- + +## Task 3.3: Update docs/guide_testing.md + +**Files:** +- Modify: `docs/guide_testing.md` (Section: "Running Tests") + +- [ ] **Step 1: Read the current "Running Tests" section** + +Use `manual-slop_get_file_slice path=docs/guide_testing.md start_line=388 end_line=448` (or grep for "## Running Tests" to find the exact range). + +- [ ] **Step 2: Replace the "All Tests" / "Specific Test File" subsections with new content referencing the new script** + +Append a new subsection before "By Marker": + +```markdown +### Batched Run (Default for Local Development) + +The default for local development is the new categorized batcher: + +```bash +python scripts/run_tests_batched.py +``` + +This runs 6 fixture-class-isolated tiers: opt-in (skipped unless `--include-opt-in`), unit (with pytest-xdist), mock_app, live_gui (one session), headless, performance. Each tier prints a summary line. Use `--plan` to see the batch plan without running; `--audit` to list unclassified files; `--tiers 1,2` to limit which tiers run. + +See `conductor/tracks/test_batching_refactor_20260606/spec.md` for the full design. +``` + +- [ ] **Step 3: Verify the docs render correctly** + +Run: `grep -A 2 "Batched Run" docs/guide_testing.md | head -10` +Expected: shows the new section. + +- [ ] **Step 4: Commit** + +```bash +git add docs/guide_testing.md +git commit -m "docs(testing): document new run_tests_batched.py in Running Tests section" +``` + +--- + +## Task 3.4: Phase 3 checkpoint + +**Files:** none (commit + note only) + +- [ ] **Step 1: Run the full new script and confirm the existing 273+ test suite still passes (modulo any pre-existing failures)** + +Run: `python scripts/run_tests_batched.py --tiers 1,2 2>&1 | tail -15` +Expected: SUMMARY table with all tier-1 and tier-2 batches either PASS or have the same pre-existing failures as before. + +- [ ] **Step 2: Run live_gui tier separately** + +Run: `python scripts/run_tests_batched.py --tiers 3 2>&1 | tail -10` +Expected: tier-3-live_gui batch runs all `*_sim.py` files; some may fail (pre-existing); no NEW failures. + +- [ ] **Step 3: Create the checkpoint commit and git note** + +```bash +git add -A +if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 3 complete - new script is default"; fi +SHA=$(git log -1 --format="%H") +git notes add -m "Phase 3 checkpoint: new run_tests_batched.py is the default. Old script preserved as .legacy. docs/guide_testing.md updated. Ready for Phase 4 cleanup." "$SHA" +``` + +- [ ] **Step 4: Update state.toml phase_3 status** + +Edit `state.toml` phase_3 line: change status to `"completed"`, fill in `checkpoint_sha`. + +```bash +git add conductor/tracks/test_batching_refactor_20260606/state.toml +git commit -m "conductor(plan): mark Phase 3 complete in test_batching_refactor_20260606" +``` + +--- + +# Phase 4: Cleanup + +> Goal: Populate the registry with the ~30 cross-cutting / ambiguous files identified during the audit. Delete the legacy script. Add `.test_durations.json` to `.gitignore`. Archive the track. + +--- + +## Task 4.1: Run --audit on a clean clone and collect cross-cutting files + +**Files:** +- Create: `docs/test_categorization_audit_20260606.md` (intermediate; can be deleted or kept) + +- [ ] **Step 1: Run --audit --strict and capture output** + +Run: +```bash +python scripts/run_tests_batched.py --audit --strict > docs/test_categorization_audit_20260606.md 2>&1 +echo "exit: $?" >> docs/test_categorization_audit_20260606.md +``` + +- [ ] **Step 2: Review the audit output and identify ~30 cross-cutting / ambiguous files** + +Open the file and find: +- Auto-inferred files with multiple subsystems (cross-cutting) +- Auto-inferred files with empty subsystems (filename doesn't start with a known prefix; e.g., `test_z_negative_flows.py`, `test_subagent_summarization.py`) +- Files where the auto-inferred batch_group feels wrong + +- [ ] **Step 3: Commit the audit report** + +```bash +git add docs/test_categorization_audit_20260606.md +git commit -m "conductor(audit): capture Phase 4 cross-cutting file audit" +``` + +--- + +## Task 4.2: Populate tests/test_categories.toml with cross-cutting entries + +**Files:** +- Create: `tests/test_categories.toml` + +- [ ] **Step 1: Create the registry file with the ~30 cross-cutting entries** + +```toml +# Hand-curated registry for cross-cutting and ambiguous tests. +# Auto-inferred records that are correct do NOT need entries here. +# Generated 2026-06-06 from docs/test_categorization_audit_20260606.md. + +[files.test_gui_dag_beads] +fixture_class = "live_gui" +subsystems = ["gui", "dag", "beads"] +batch_group = "gui" +notes = "Cross-cutting: drives GUI, asserts on DAG state, exercises Beads backend" + +[files.test_arch_boundary_phase1] +subsystems = ["architecture"] +batch_group = "mma" +notes = "Phase 1 of arch-boundary refactor; subsystem ambiguous (arch/phase), group = mma" + +[files.test_arch_boundary_phase2] +subsystems = ["architecture"] +batch_group = "mma" + +[files.test_arch_boundary_phase3] +subsystems = ["architecture"] +batch_group = "mma" + +[files.test_z_negative_flows] +subsystems = ["misc"] +batch_group = "core" +notes = "Filename prefix 'z' is not a known subsystem; auto-inference fails" + +[files.test_subagent_summarization] +subsystems = ["mma", "tiered"] +batch_group = "mma" + +[files.test_tiered_aggregation] +subsystems = ["mma", "tiered"] +batch_group = "mma" + +[files.test_tiered_context] +subsystems = ["mma", "tiered"] +batch_group = "mma" + +[files.test_tier4_interceptor] +subsystems = ["tier4", "mma"] +batch_group = "mma" + +[files.test_tier4_patch_generation] +subsystems = ["tier4", "mma"] +batch_group = "mma" + +# Continue with ~20 more entries identified in the audit... +# (The engineer filling this in should add entries for all files flagged +# by --audit --strict in Task 4.1) +``` + +(The full ~30 entries are filled in by the implementer based on the audit output. The pattern above shows the schema.) + +- [ ] **Step 2: Re-run --audit; expect fewer auto-inferred records and zero strict violations** + +Run: `python scripts/run_tests_batched.py --audit --strict 2>&1 | tail -5; echo "exit: $?"` +Expected: exit 0; STRICT violations are zero (or significantly reduced). + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_categories.toml +git commit -m "feat(tests): populate test_categories.toml with cross-cutting entries" +``` + +--- + +## Task 4.3: Add .test_durations.json to .gitignore + +**Files:** +- Modify: `.gitignore` + +- [ ] **Step 1: Read .gitignore and find a sensible place to add the entry** + +Run: `Get-Content .gitignore | Select-String -Pattern "tests" -SimpleMatch` to see existing test-related entries. + +- [ ] **Step 2: Append the new entry** + +Add (at the end, or grouped with other test artifact ignores): + +``` +# Local test duration cache (developer-local; regenerated on each batched run) +tests/.test_durations.json +``` + +- [ ] **Step 3: Verify the file is now ignored** + +Run: `git check-ignore -v tests/.test_durations.json` +Expected: prints `.gitignore: tests/.test_durations.json` confirming the ignore. + +- [ ] **Step 4: Commit** + +```bash +git add .gitignore +git commit -m "chore: gitignore tests/.test_durations.json (developer-local cache)" +``` + +--- + +## Task 4.4: Delete the legacy script + +**Files:** +- Delete: `scripts/run_tests_batched.py.legacy` + +- [ ] **Step 1: Confirm no remaining references to the legacy script** + +Run: `rg "run_tests_batched.py.legacy" .` (or grep recursively) +Expected: no matches (the legacy file should be unused). + +- [ ] **Step 2: Delete the file** + +Run: `Remove-Item scripts/run_tests_batched.py.legacy` + +- [ ] **Step 3: Commit** + +```bash +git add -A +git commit -m "chore: delete legacy run_tests_batched.py (was preserved for one cycle)" +``` + +--- + +## Task 4.5: Archive the track + +**Files:** +- Move: `conductor/tracks/test_batching_refactor_20260606/` → `conductor/tracks/archive/test_batching_refactor_20260606/` (or the archive convention used by this repo) + +- [ ] **Step 1: Check the archive convention used in this repo** + +Run: `Get-ChildItem conductor/tracks/archive_completed_tracks_20260603 | Select-Object Name -First 3` +Or: `Get-ChildItem conductor/tracks -Filter "*archive*" -Directory | Select-Object Name` + +Use the convention the project uses (likely a single archive directory or per-date directories). + +- [ ] **Step 2: Move the track directory** + +```bash +git mv conductor/tracks/test_batching_refactor_20260606 conductor/tracks/archive/test_batching_refactor_20260606 +``` + +(Adjust the destination path per Step 1.) + +- [ ] **Step 3: Update conductor/tracks.md to move the entry to "Recently Completed" or the equivalent section** + +Edit `conductor/tracks.md`: move the line for `test_batching_refactor_20260606` from the "Remaining Backlog" section to "Recently Completed Tracks (2026-06+)" or similar. Change status from `[~]` (in progress) to `[x]` (completed). + +- [ ] **Step 4: Commit** + +```bash +git add conductor/tracks.md +git mv conductor/tracks/test_batching_refactor_20260606 conductor/tracks/archive/test_batching_refactor_20260606 +git commit -m "conductor(archive): ship test_batching_refactor_20260606 to archive" +``` + +--- + +## Task 4.6: Phase 4 checkpoint (final) + +**Files:** none (commit + note only) + +- [ ] **Step 1: Run the full new script and confirm zero STRICT audit violations and all tests still pass** + +Run: +```bash +python scripts/run_tests_batched.py --audit --strict; echo "audit exit: $?" +python scripts/run_tests_batched.py --tiers 1,2 2>&1 | tail -10 +``` + +Expected: audit exits 0; tier-1 and tier-2 batches PASS or have only the same pre-existing failures as the baseline. + +- [ ] **Step 2: Create the final checkpoint commit and git note** + +```bash +git add -A +if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 4 complete - track shipped to archive"; fi +SHA=$(git log -1 --format="%H") +git notes add -m "Phase 4 checkpoint (TRACK COMPLETE): test_batching_refactor_20260606 + +- tests/test_categories.toml populated with ~30 cross-cutting entries +- .test_durations.json added to .gitignore +- scripts/run_tests_batched.py.legacy deleted +- Track archived to conductor/tracks/archive/ + +Final state: scripts/run_tests_batched.py is the new default; --plan/--audit +modes work; --tiers filters by tier; --include-opt-in gates opt-in tests; +--durations records developer-local timing cache. live_gui tests in ONE +pytest invocation (15s startup amortized). pytest-xdist for unit tier. + +No regressions in 273+ existing tests." "$SHA" +``` + +- [ ] **Step 3: Update state.toml final status** + +Edit `state.toml`: +- `current_phase = 4` (or remove this field) +- All phase_N entries: `status = "completed"`, `checkpoint_sha` filled in +- Add a final note at the bottom: `# Track completed 2026-06-06 and archived.` + +```bash +git add conductor/tracks/test_batching_refactor_20260606/state.toml +git commit -m "conductor(plan): mark Phase 4 complete in test_batching_refactor_20260606" +``` + +--- + +# Self-Review + +**1. Spec coverage:** +- Section 1 (Problem Statement) — addressed by Task 1.14 / 3.1 (replace alphabetical with categorized). +- Section 2 (Goals) — B (process isolation): Task 1.10 / 3.1 (one batch per tier). A (subsystem grouping): Task 1.5 (subsystem inference) + 1.10 (batch_group). C (xdist + session reuse): Task 1.10 + 3.1. +- Section 3 (Architecture) — 3-tier model: Task 1.10 (`plan()`). Registry: Task 1.7. Auto-inference: Tasks 1.3-1.5. +- Section 4 (Components) — categorizer: Tasks 1.1-1.7. batcher: Task 1.10. CLI orchestrator: Tasks 1.14 + 3.1. plugin: Tasks 1.11-1.13. +- Section 5 (Output) — implemented in Task 3.1 (`_print_summary`). +- Section 6 (CLI) — `--tiers`, `--include-opt-in`, `--plan`, `--audit`, `--strict`, `--no-xdist` all wired in Tasks 1.14 + 3.1. +- Section 7 (Config) — `pyproject.toml` markers (already correct; no change). `.test_durations.json`: Task 4.3. +- Section 8 (Migration) — all 4 phases mapped to plan phases. +- Section 9 (Risks) — auto-inference misclassification: covered by `--audit --strict` (Task 1.14). Tier-3 crash: `--maxfail=1` (Task 1.10). xdist non-determinism: noted in Task 3.1's verification. New tests unclassified: `--audit` (Task 1.14). +- Section 10 (Open Questions) — Q1: registry in `tests/` (Task 4.2). Q2: batch_group inferred by default (Task 1.5). Q3: deferred. Q4: per-run by default (Task 3.1). + +**2. Placeholder scan:** No "TBD", "TODO", "implement later". One intentional "[Engineer fills in ~20 more entries]" in Task 4.2 with a clear pattern to follow. + +**3. Type consistency:** `FixtureClass`, `Speed`, `CategoryRecord` defined in Task 1.1; used in all subsequent categorizer tests and in `plan()` (Task 1.10). `Batch` defined in Task 1.10; used in Task 3.1's `_run_batch` and `_print_summary`. `sort_items_by_order` signature stable across Tasks 1.11 and 1.12. + +No issues found. Plan ready for execution.