diff --git a/conductor/tracks/test_batching_refactor_20260606/plan.md b/conductor/tracks/test_batching_refactor_20260606/plan.md
new file mode 100644
index 00000000..8b68deee
--- /dev/null
+++ b/conductor/tracks/test_batching_refactor_20260606/plan.md
@@ -0,0 +1,1756 @@
+# Test Batching Refactor — Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace alphabetical 4-at-a-time test batching in `scripts/run_tests_batched.py` with a fixture-class-isolated, subsystem-grouped, xdist-accelerated 6-tier model. Hybrid auto-infer + registry classification. Opt-in per-test order via conftest plugin.
+
+**Architecture:** Three new library modules under `scripts/`: `test_categorizer.py` (pure classifier), `test_batcher.py` (pure scheduler), `pytest_collection_order.py` (conftest-loaded plugin for opt-in per-test order). One CLI orchestrator replacing the current `run_tests_batched.py` (slim, delegates everything). One hand-curated registry `tests/test_categories.toml` for cross-cutting and ambiguous files. Process isolation enforced by spawning one pytest subprocess per `(tier, batch_group)` pair.
+
+**Tech Stack:** Python 3.11+ (stdlib `tomllib`, `re`, `pathlib`, `dataclasses`, `enum`, `ast`), pytest, pytest-xdist (optional, `--no-xdist` to skip), tomli-w (NOT needed; we only read). **1-space indentation mandatory.** No comments in production code.
+
+**Reference:** See `conductor/tracks/test_batching_refactor_20260606/spec.md` for the full design, data model, auto-inference rules, and CLI surface.
+
+---
+
+## File Structure
+
+| File | Action | Responsibility |
+|---|---|---|
+| `scripts/test_categorizer.py` | Create | Pure classifier: enums, dataclass, `auto_classify`, `load_registry`, `merge_registry`, `categorize_all` |
+| `scripts/test_batcher.py` | Create | Pure scheduler: `Batch` dataclass, `plan(records, options) -> list[Batch]` |
+| `scripts/pytest_collection_order.py` | Create | Conftest plugin: `pytest_collection_modifyitems` hook sorts items by `order` index from registry |
+| `scripts/run_tests_batched.py` | Modify (Phase 1: stub `--plan`/`--audit`; Phase 3: full CLI) | CLI orchestrator: parse args, call `categorize_all`, dispatch to `plan` + `subprocess.run` |
+| `tests/test_categorizer.py` | Create | Unit tests for the categorizer (11+ tests) |
+| `tests/test_batcher.py` | Create | Unit tests for the batcher (5+ tests) |
+| `tests/test_pytest_collection_order.py` | Create | Unit tests for the plugin (2+ tests) |
+| `tests/test_categories.toml` | Create (Phase 4 content) | Hand-curated registry; empty/optional in Phases 1-3 |
+| `tests/conftest.py` | Modify | Register `scripts.pytest_collection_order` via `pytest_plugins` |
+| `docs/guide_testing.md` | Modify (Phase 3) | Update "Running Tests" section to reference new script |
+| `pyproject.toml` | Modify (Phase 4) | No structural change; verify markers list |
+| `.gitignore` | Modify (Phase 4) | Add `tests/.test_durations.json` |
+| `scripts/run_tests_batched.py.legacy` | Create (Phase 3) → Delete (Phase 4) | Old script preserved for one cycle |
+
+---
+
+# Phase 1: Library + dry-run
+
+> Goal: All new library code exists with passing tests. The new `run_tests_batched.py` only has `--plan` and `--audit` modes (no actual test execution). The old `run_tests_batched.py` is untouched. The conftest plugin is wired and is a no-op (no `[[test_order]]` entries exist yet).
+
+---
+
+## Task 1.1: Add data model types to scripts/test_categorizer.py
+
+**Files:**
+- Create: `scripts/test_categorizer.py`
+
+- [ ] **Step 1: Create the file with enums and CategoryRecord dataclass**
+
+```python
+from dataclasses import dataclass, field
+from enum import Enum
+
+class FixtureClass(str, Enum):
+ UNIT = "unit"
+ MOCK_APP = "mock_app"
+ LIVE_GUI = "live_gui"
+ HEADLESS = "headless"
+ OPT_IN = "opt_in"
+ PERFORMANCE = "performance"
+
+class Speed(str, Enum):
+ FAST = "fast"
+ MEDIUM = "medium"
+ SLOW = "slow"
+ VERY_SLOW = "very_slow"
+
+@dataclass(frozen=True)
+class CategoryRecord:
+ filename: str
+ fixture_class: FixtureClass
+ subsystems: list[str]
+ speed: Speed
+ batch_group: str
+ notes: str = ""
+ test_order: dict[str, int] = field(default_factory=dict)
+ source: str = "auto"
+ warnings: list[str] = field(default_factory=list)
+```
+
+- [ ] **Step 2: Verify the file is importable**
+
+Run: `uv run python -c "from scripts.test_categorizer import CategoryRecord, FixtureClass, Speed; print(CategoryRecord(filename='x', fixture_class=FixtureClass.UNIT, subsystems=['core'], speed=Speed.FAST, batch_group='core'))"`
+Expected: prints a `CategoryRecord(filename='x', ...)` line with no errors.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add scripts/test_categorizer.py
+git commit -m "feat(scripts): add CategoryRecord data model for test categorization"
+```
+
+---
+
+## Task 1.2: Write red tests for auto_classify fixture_class rules
+
+**Files:**
+- Create: `tests/test_categorizer.py`
+
+- [ ] **Step 1: Create test file with 6 failing tests covering fixture_class auto-inference**
+
+```python
+from pathlib import Path
+import pytest
+from scripts.test_categorizer import FixtureClass, Speed, auto_classify
+
+def _write(tmp_path: Path, name: str, content: str) -> Path:
+ p = tmp_path / name
+ p.write_text(content, encoding="utf-8")
+ return p
+
+def test_auto_classify_clean_install_filename(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_clean_install.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.OPT_IN
+
+def test_auto_classify_docker_build_filename(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_docker_build.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.OPT_IN
+
+def test_auto_classify_live_gui_fixture_in_source(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_x.py", "def test_x(live_gui): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.LIVE_GUI
+
+def test_auto_classify_mock_app_fixture_in_source(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_x.py", "def test_x(mock_app): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.MOCK_APP
+
+def test_auto_classify_perf_keyword_in_filename(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_xyz_stress.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.PERFORMANCE
+
+def test_auto_classify_default_to_unit(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_command_palette.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.fixture_class == FixtureClass.UNIT
+```
+
+- [ ] **Step 2: Run the tests, confirm all 6 fail**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: All 6 tests FAIL with `AttributeError: module 'scripts.test_categorizer' has no attribute 'auto_classify'`.
+
+- [ ] **Step 3: Commit (red)**
+
+```bash
+git add tests/test_categorizer.py
+git commit -m "test(categorizer): add red tests for auto_classify fixture_class rules"
+```
+
+---
+
+## Task 1.3: Implement auto_classify fixture_class rules
+
+**Files:**
+- Modify: `scripts/test_categorizer.py`
+
+- [ ] **Step 1: Add the auto_classify function with fixture_class rules**
+
+Append to `scripts/test_categorizer.py`:
+
+```python
+import re
+from pathlib import Path
+
+_OPT_IN_PATTERN = re.compile(r"^test_(clean_install|docker_build)")
+_LIVE_GUI_PATTERN = re.compile(r"\(live_gui\)\s*[:,)]")
+_MOCK_APP_PATTERN = re.compile(r"\b(mock_app|app_instance)\b")
+_PERF_KEYWORDS = ("perf", "stress", "phase_3_final", "phase_4_stress")
+
+def _classify_fixture_class(path: Path, source: str) -> FixtureClass:
+ name = path.name
+ if _OPT_IN_PATTERN.match(name):
+ return FixtureClass.OPT_IN
+ if _LIVE_GUI_PATTERN.search(source):
+ return FixtureClass.LIVE_GUI
+ if _MOCK_APP_PATTERN.search(source):
+ return FixtureClass.MOCK_APP
+ lowered = name.lower()
+ for kw in _PERF_KEYWORDS:
+ if kw in lowered:
+ return FixtureClass.PERFORMANCE
+ return FixtureClass.UNIT
+
+def auto_classify(path: Path, durations: dict[str, float] | None = None) -> CategoryRecord:
+ source = path.read_text(encoding="utf-8", errors="replace")
+ fixture_class = _classify_fixture_class(path, source)
+ return CategoryRecord(
+ filename=path.name,
+ fixture_class=fixture_class,
+ subsystems=[],
+ speed=Speed.MEDIUM,
+ batch_group="",
+ source="auto",
+ )
+```
+
+- [ ] **Step 2: Run the 6 tests, confirm they pass**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: All 6 tests PASS.
+
+- [ ] **Step 3: Commit (green)**
+
+```bash
+git add scripts/test_categorizer.py
+git commit -m "feat(categorizer): implement auto_classify fixture_class rules"
+```
+
+---
+
+## Task 1.4: Write red tests for subsystem, speed, and batch_group inference
+
+**Files:**
+- Modify: `tests/test_categorizer.py`
+
+- [ ] **Step 1: Append 4 failing tests for the new inference functions**
+
+```python
+def test_subsystem_inference_known_prefix(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_mcp_client_foo.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert "mcp" in r.subsystems
+
+def test_speed_inference_from_durations_fast(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_x.py", "def test_x(): pass\n")
+ durations = {f"{p.name}::test_x": 0.05}
+ r = auto_classify(p, durations=durations)
+ assert r.speed == Speed.FAST
+
+def test_speed_default_medium_without_durations(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_x.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.speed == Speed.MEDIUM
+
+def test_batch_group_inference_gui_subsystem(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_gui_layout_foo.py", "def test_x(): pass\n")
+ r = auto_classify(p)
+ assert r.batch_group == "gui"
+```
+
+- [ ] **Step 2: Run, confirm 4 new tests fail (and previous 6 still pass)**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: 4 new tests FAIL with `AssertionError` on `subsystems`/`speed`/`batch_group` (which are currently empty/default).
+
+- [ ] **Step 3: Commit (red)**
+
+```bash
+git add tests/test_categorizer.py
+git commit -m "test(categorizer): add red tests for subsystem/speed/batch_group inference"
+```
+
+---
+
+## Task 1.5: Implement subsystem, speed, and batch_group inference
+
+**Files:**
+- Modify: `scripts/test_categorizer.py`
+
+- [ ] **Step 1: Add the constants and inference functions, wire them into auto_classify**
+
+```python
+_SUBSYSTEM_PREFIXES = (
+ "ai", "api", "arch", "ast", "async", "auto", "beads", "bias", "cache",
+ "cli", "cmd", "comms", "conductor", "context", "cost", "dag", "deepseek",
+ "diff", "discussion", "event", "execution", "external", "ext", "fuzzy",
+ "gemini", "gui", "headless", "history", "hooks", "hot", "imgui", "layout",
+ "live", "log", "mcp", "markdown", "minimax", "mma", "model", "orchestrator",
+ "outline", "parallel", "patch", "perf", "persona", "phase", "pipeline",
+ "preset", "prior", "process", "project", "provider", "rag", "script",
+ "session", "shader", "sim", "skeleton", "slice", "spawn", "status",
+ "subagent", "summary", "symbol", "sync", "synthesis", "system", "takes",
+ "theme", "thinking", "ticket", "tier4", "tiered", "token", "tool", "track",
+ "tree", "ts", "undo", "usage", "user", "vendor", "view", "visual",
+ "vlogger", "websocket", "workflow", "workspace", "z",
+)
+
+_BATCH_GROUP_CLUSTERS: dict[str, tuple[str, ...]] = {
+ "core": (
+ "mcp", "ai", "context", "api", "dag", "path", "presets", "personas",
+ "history", "workspace", "rag", "beads", "model", "ast", "async", "cache",
+ "cli", "cmd", "fuzzy", "hooks", "log", "markdown", "orchestrator",
+ "outline", "pipeline", "project", "provider", "script", "session",
+ "skeleton", "slice", "spawn", "status", "subagent", "summary", "symbol",
+ "sync", "synthesis", "system", "takes", "thinking", "tier4", "tiered",
+ "tool", "track", "tree", "ts", "usage", "vendor", "vlogger", "websocket",
+ "workflow",
+ ),
+ "gui": ("gui", "theme", "imgui", "layout", "live", "prior", "visual", "view", "undo"),
+ "mma": ("mma", "conductor", "execution", "ext", "external", "auto", "manual", "tier", "arch", "phase", "process", "z"),
+ "comms": ("comms", "diff", "patch", "event", "hot", "process", "shader"),
+ "headless": ("headless",),
+}
+
+def _infer_subsystems(filename: str) -> list[str]:
+ stem = filename.removeprefix("test_").removesuffix(".py")
+ for prefix in sorted(_SUBSYSTEM_PREFIXES, key=len, reverse=True):
+ if stem.startswith(prefix + "_") or stem == prefix:
+ return [prefix]
+ return []
+
+def _infer_batch_group(subsystems: list[str]) -> str:
+ if not subsystems:
+ return "core"
+ first = subsystems[0]
+ for group, members in _BATCH_GROUP_CLUSTERS.items():
+ if first in members:
+ return group
+ return "core"
+
+def _infer_speed(filename: str, durations: dict[str, float] | None) -> Speed:
+ if not durations:
+ return Speed.MEDIUM
+ matching = [v for k, v in durations.items() if k.startswith(filename + "::")]
+ if not matching:
+ return Speed.MEDIUM
+ p95 = sorted(matching)[int(len(matching) * 0.95)]
+ if p95 < 1.0:
+ return Speed.FAST
+ if p95 < 5.0:
+ return Speed.MEDIUM
+ if p95 < 30.0:
+ return Speed.SLOW
+ return Speed.VERY_SLOW
+
+def auto_classify(path: Path, durations: dict[str, float] | None = None) -> CategoryRecord:
+ source = path.read_text(encoding="utf-8", errors="replace")
+ fixture_class = _classify_fixture_class(path, source)
+ subsystems = _infer_subsystems(path.name)
+ speed = _infer_speed(path.name, durations)
+ batch_group = _infer_batch_group(subsystems)
+ return CategoryRecord(
+ filename=path.name,
+ fixture_class=fixture_class,
+ subsystems=subsystems,
+ speed=speed,
+ batch_group=batch_group,
+ source="auto",
+ )
+```
+
+- [ ] **Step 2: Run all 10 tests, confirm they pass**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: All 10 tests PASS.
+
+- [ ] **Step 3: Commit (green)**
+
+```bash
+git add scripts/test_categorizer.py
+git commit -m "feat(categorizer): implement subsystem/speed/batch_group inference"
+```
+
+---
+
+## Task 1.6: Write red tests for merge_registry and categorize_all
+
+**Files:**
+- Modify: `tests/test_categorizer.py`
+
+- [ ] **Step 1: Append 3 failing tests for registry merge and full classification**
+
+```python
+import tomllib
+from scripts.test_categorizer import load_registry, merge_registry, categorize_all
+
+def test_load_registry_returns_dict(tmp_path: Path) -> None:
+ toml = tmp_path / "reg.toml"
+ toml.write_text('[files.test_x]\nfixture_class = "mock_app"\n', encoding="utf-8")
+ reg = load_registry(toml)
+ assert "test_x" in reg
+ assert reg["test_x"]["fixture_class"] == "mock_app"
+
+def test_merge_registry_overrides_auto(tmp_path: Path) -> None:
+ p = _write(tmp_path, "test_x.py", "def test_x(): pass\n")
+ auto = auto_classify(p)
+ assert auto.fixture_class == FixtureClass.UNIT
+ reg_entry = {"fixture_class": "mock_app", "subsystems": ["x"], "speed": "fast", "batch_group": "x"}
+ merged = merge_registry(auto, reg_entry)
+ assert merged.fixture_class == FixtureClass.MOCK_APP
+ assert merged.source == "registry"
+ assert "subsystems-override" in " ".join(merged.warnings) or merged.subsystems == ["x"]
+
+def test_categorize_all_handles_real_tests_dir(tmp_path: Path) -> None:
+ (tmp_path / "test_a.py").write_text("def test_x(): pass\n", encoding="utf-8")
+ (tmp_path / "test_b_sim.py").write_text("def test_x(live_gui): pass\n", encoding="utf-8")
+ reg_path = tmp_path / "reg.toml"
+ reg_path.write_text("", encoding="utf-8")
+ records = categorize_all(tmp_path, reg_path)
+ assert len(records) == 2
+ by_name = {r.filename: r for r in records}
+ assert by_name["test_a.py"].fixture_class == FixtureClass.UNIT
+ assert by_name["test_b_sim.py"].fixture_class == FixtureClass.LIVE_GUI
+```
+
+- [ ] **Step 2: Run, confirm 3 new tests fail**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: 3 new tests FAIL with `ImportError` (or `AttributeError` on `load_registry`/`merge_registry`/`categorize_all`).
+
+- [ ] **Step 3: Commit (red)**
+
+```bash
+git add tests/test_categorizer.py
+git commit -m "test(categorizer): add red tests for registry merge and full classification"
+```
+
+---
+
+## Task 1.7: Implement load_registry, merge_registry, categorize_all
+
+**Files:**
+- Modify: `scripts/test_categorizer.py`
+
+- [ ] **Step 1: Add the three new functions**
+
+```python
+def load_registry(toml_path: Path) -> dict[str, dict]:
+ if not toml_path.exists():
+ return {}
+ with toml_path.open("rb") as f:
+ data = tomllib.load(f)
+ return data.get("files", {})
+
+def merge_registry(auto: CategoryRecord, entry: dict) -> CategoryRecord:
+ warnings = list(auto.warnings)
+ if "fixture_class" in entry and entry["fixture_class"] != auto.fixture_class.value:
+ warnings.append(f"fixture_class-override: {auto.fixture_class.value} -> {entry['fixture_class']}")
+ if "subsystems" in entry and set(entry["subsystems"]) != set(auto.subsystems):
+ warnings.append(f"subsystems-override: {auto.subsystems} -> {entry['subsystems']}")
+ return CategoryRecord(
+ filename=auto.filename,
+ fixture_class=FixtureClass(entry.get("fixture_class", auto.fixture_class.value)),
+ subsystems=list(entry.get("subsystems", auto.subsystems)),
+ speed=Speed(entry.get("speed", auto.speed.value)),
+ batch_group=entry.get("batch_group", auto.batch_group),
+ notes=entry.get("notes", auto.notes),
+ test_order=dict(auto.test_order),
+ source="registry",
+ warnings=warnings,
+ )
+
+def categorize_all(tests_dir: Path, registry_path: Path) -> list[CategoryRecord]:
+ registry = load_registry(registry_path)
+ records: list[CategoryRecord] = []
+ for path in sorted(tests_dir.glob("test_*.py")):
+ auto = auto_classify(path)
+ entry = registry.get(path.name, {})
+ if entry:
+ records.append(merge_registry(auto, entry))
+ else:
+ records.append(auto)
+ return records
+```
+
+- [ ] **Step 2: Run all 13 tests, confirm they pass**
+
+Run: `uv run pytest tests/test_categorizer.py -v`
+Expected: All 13 tests PASS.
+
+- [ ] **Step 3: Commit (green)**
+
+```bash
+git add scripts/test_categorizer.py
+git commit -m "feat(categorizer): implement load_registry, merge_registry, categorize_all"
+```
+
+---
+
+## Task 1.8: Smoke-test categorize_all on the real tests/ directory
+
+**Files:**
+- Create: `tests/test_categorizer_smoke.py` (ephemeral, deleted in this task)
+
+- [ ] **Step 1: Run the categorizer against the real tests/ dir and inspect output**
+
+Run:
+```bash
+uv run python -c "
+from pathlib import Path
+from scripts.test_categorizer import categorize_all, FixtureClass
+records = categorize_all(Path('tests'), Path('tests/test_categories.toml'))
+print(f'Total: {len(records)}')
+from collections import Counter
+fc_counts = Counter(r.fixture_class for r in records)
+for fc, n in sorted(fc_counts.items(), key=lambda x: x[0].value):
+ print(f'  {fc.value}: {n}')
+print('Subsystem distribution (top 10):')
+sub_counts = Counter()
+for r in records:
+ for s in r.subsystems:
+ sub_counts[s] += 1
+for s, n in sub_counts.most_common(10):
+ print(f'  {s}: {n}')
+"
+```
+
+Expected: prints `Total: 277` (or current count) and a tier breakdown. Confirm:
+- `opt_in: 2` (test_clean_install + test_docker_build)
+- `live_gui: 14` (all `*_sim.py` files)
+- `unit: ~200+` (the rest)
+- A few `mock_app` (files that reference mock_app/app_instance)
+
+- [ ] **Step 2: Sanity-check a few specific files**
+
+Run:
+```bash
+uv run python -c "
+from pathlib import Path
+from scripts.test_categorizer import categorize_all
+records = {r.filename: r for r in categorize_all(Path('tests'), Path('tests/test_categories.toml'))}
+for f in ['test_gui_dag_beads.py', 'test_arch_boundary_phase1.py', 'test_mcp_client.py', 'test_command_palette.py']:
+ r = records[f]
+ print(f'{f}: fc={r.fixture_class.value}, subs={r.subsystems}, bg={r.batch_group}, speed={r.speed.value}')
+"
+```
+
+Expected: `test_mcp_client.py` → `fc=unit, subs=['mcp'], bg=core`. Other files have sensible values.
+
+- [ ] **Step 3: Delete the smoke-test scratch file (if any was created) and commit nothing**
+
+No commit. The smoke test was a one-shot `python -c` invocation.
+
+---
+
+## Task 1.9: Write red tests for scripts/test_batcher.py::plan
+
+**Files:**
+- Create: `tests/test_batcher.py`
+
+- [ ] **Step 1: Create test file with 5 failing tests for the plan function**
+
+```python
+from pathlib import Path
+import pytest
+from scripts.test_categorizer import CategoryRecord, FixtureClass, Speed
+from scripts.test_batcher import Batch, plan
+
+def _rec(name: str, fc: FixtureClass, bg: str = "core") -> CategoryRecord:
+ return CategoryRecord(
+ filename=name,
+ fixture_class=fc,
+ subsystems=[bg] if bg else [],
+ speed=Speed.MEDIUM,
+ batch_group=bg,
+ )
+
+def test_plan_groups_unit_by_batch_group() -> None:
+ records = [
+ _rec("test_a.py", FixtureClass.UNIT, "core"),
+ _rec("test_b.py", FixtureClass.UNIT, "gui"),
+ _rec("test_c.py", FixtureClass.UNIT, "core"),
+ ]
+ batches = plan(records)
+ unit_batches = [b for b in batches if b.tier == "1"]
+ labels = {b.label for b in unit_batches}
+ assert "tier-1-unit-core" in labels
+ assert "tier-1-unit-gui" in labels
+
+def test_plan_live_gui_tier_is_one_batch() -> None:
+ records = [
+ _rec("test_a_sim.py", FixtureClass.LIVE_GUI, "gui"),
+ _rec("test_b_sim.py", FixtureClass.LIVE_GUI, "core"),
+ _rec("test_c_sim.py", FixtureClass.LIVE_GUI, "mma"),
+ ]
+ batches = plan(records)
+ live_batches = [b for b in batches if b.tier == "3"]
+ assert len(live_batches) == 1
+ assert len(live_batches[0].files) == 3
+
+def test_plan_opt_in_skipped_without_flag() -> None:
+ records = [
+ _rec("test_clean_install.py", FixtureClass.OPT_IN),
+ _rec("test_docker_build.py", FixtureClass.OPT_IN),
+ ]
+ batches = plan(records, include_opt_in=False)
+ opt_batches = [b for b in batches if b.tier == "0"]
+ assert all(b.skip_reason for b in opt_batches)
+
+def test_plan_is_deterministic() -> None:
+ records = [
+ _rec("test_a.py", FixtureClass.UNIT, "core"),
+ _rec("test_b.py", FixtureClass.UNIT, "gui"),
+ ]
+ a = plan(records)
+ b = plan(records)
+ assert [(x.tier, x.label, len(x.files)) for x in a] == [(y.tier, y.label, len(y.files)) for y in b]
+
+def test_plan_xdist_only_for_tier_1() -> None:
+ records = [_rec("test_a.py", FixtureClass.UNIT, "core")]
+ batches = plan(records, xdist=True)
+ unit = next(b for b in batches if b.tier == "1")
+ assert "-n" in unit.pytest_args
+ assert "auto" in unit.pytest_args
+ mock_records = [_rec("test_a.py", FixtureClass.MOCK_APP, "core")]
+ batches2 = plan(mock_records, xdist=True)
+ mock = next(b for b in batches2 if b.tier == "2")
+ assert "-n" not in mock.pytest_args
+```
+
+- [ ] **Step 2: Run, confirm 5 tests fail**
+
+Run: `uv run pytest tests/test_batcher.py -v`
+Expected: All 5 tests FAIL with `ImportError: cannot import name 'Batch' from 'scripts.test_batcher'`.
+
+- [ ] **Step 3: Commit (red)**
+
+```bash
+git add tests/test_batcher.py
+git commit -m "test(batcher): add red tests for plan() function"
+```
+
+---
+
+## Task 1.10: Implement scripts/test_batcher.py
+
+**Files:**
+- Create: `scripts/test_batcher.py`
+
+- [ ] **Step 1: Create the file with Batch dataclass and plan function**
+
+```python
+from dataclasses import dataclass
+from pathlib import Path
+from scripts.test_categorizer import CategoryRecord, FixtureClass
+
+@dataclass(frozen=True)
+class Batch:
+ tier: str
+ label: str
+ files: list[Path]
+ pytest_args: list[str]
+ estimated_seconds: float
+ skip_reason: str | None = None
+
+_TIER_ORDER = ("0", "1", "2", "3", "H", "P")
+
+def _batches_for_unit(records: list[CategoryRecord], xdist: bool) -> list[Batch]:
+ by_group: dict[str, list[CategoryRecord]] = {}
+ for r in records:
+ by_group.setdefault(r.batch_group or "core", []).append(r)
+ batches: list[Batch] = []
+ for group in sorted(by_group):
+ files = [Path("tests") / r.filename for r in by_group[group]]
+ args: list[str] = ["--maxfail=10"]
+ if xdist:
+ args = ["-n", "auto"] + args
+ batches.append(Batch(
+ tier="1",
+ label=f"tier-1-unit-{group}",
+ files=files,
+ pytest_args=args,
+ estimated_seconds=sum(_est(r) for r in by_group[group]),
+ ))
+ return batches
+
+def _batches_for_mock_app(records: list[CategoryRecord]) -> list[Batch]:
+ by_group: dict[str, list[CategoryRecord]] = {}
+ for r in records:
+ by_group.setdefault(r.batch_group or "core", []).append(r)
+ batches: list[Batch] = []
+ for group in sorted(by_group):
+ files = [Path("tests") / r.filename for r in by_group[group]]
+ batches.append(Batch(
+ tier="2",
+ label=f"tier-2-mock_app-{group}",
+ files=files,
+ pytest_args=["--maxfail=5"],
+ estimated_seconds=sum(_est(r) for r in by_group[group]),
+ ))
+ return batches
+
+def _batches_for_live_gui(records: list[CategoryRecord]) -> list[Batch]:
+ if not records:
+ return []
+ files = [Path("tests") / r.filename for r in records]
+ return [Batch(
+ tier="3",
+ label="tier-3-live_gui",
+ files=files,
+ pytest_args=["--maxfail=1"],
+ estimated_seconds=sum(_est(r) for r in records),
+ )]
+
+def _batches_for_headless(records: list[CategoryRecord]) -> list[Batch]:
+ if not records:
+ return []
+ files = [Path("tests") / r.filename for r in records]
+ return [Batch(
+ tier="H",
+ label="tier-H-headless",
+ files=files,
+ pytest_args=["--maxfail=5"],
+ estimated_seconds=sum(_est(r) for r in records),
+ )]
+
+def _batches_for_performance(records: list[CategoryRecord]) -> list[Batch]:
+ if not records:
+ return []
+ files = [Path("tests") / r.filename for r in records]
+ return [Batch(
+ tier="P",
+ label="tier-P-performance",
+ files=files,
+ pytest_args=["--maxfail=1"],
+ estimated_seconds=sum(_est(r) for r in records),
+ )]
+
+def _batches_for_opt_in(records: list[CategoryRecord], include_opt_in: bool) -> list[Batch]:
+ batches: list[Batch] = []
+ for r in records:
+ files = [Path("tests") / r.filename]
+ skip_reason: str | None = None
+ if not include_opt_in:
+ skip_reason = "--include-opt-in not set"
+ elif r.filename.startswith("test_clean_install") and not _env_set("RUN_CLEAN_INSTALL_TEST"):
+ skip_reason = "RUN_CLEAN_INSTALL_TEST not set"
+ elif r.filename.startswith("test_docker_build") and not _env_set("RUN_DOCKER_TEST"):
+ skip_reason = "RUN_DOCKER_TEST not set"
+ batches.append(Batch(
+ tier="0",
+ label=f"tier-0-opt_in-{r.filename.removeprefix('test_').removesuffix('.py')}",
+ files=files,
+ pytest_args=["--maxfail=1"],
+ estimated_seconds=_est(r),
+ skip_reason=skip_reason,
+ ))
+ return batches
+
+import os
+def _env_set(name: str) -> bool:
+ return bool(os.environ.get(name))
+
+_SPEED_SECONDS = {"fast": 0.5, "medium": 3.0, "slow": 15.0, "very_slow": 60.0}
+def _est(r: CategoryRecord) -> float:
+ return _SPEED_SECONDS.get(r.speed.value, 3.0)
+
+def plan(
+ records: list[CategoryRecord],
+ *,
+ tiers: set[str] = set(_TIER_ORDER),
+ include_opt_in: bool = False,
+ xdist: bool = True,
+) -> list[Batch]:
+ by_fc: dict[FixtureClass, list[CategoryRecord]] = {fc: [] for fc in FixtureClass}
+ for r in records:
+ by_fc[r.fixture_class].append(r)
+ out: list[Batch] = []
+ if "0" in tiers:
+ out.extend(_batches_for_opt_in(by_fc[FixtureClass.OPT_IN], include_opt_in))
+ if "1" in tiers:
+ out.extend(_batches_for_unit(by_fc[FixtureClass.UNIT], xdist))
+ if "2" in tiers:
+ out.extend(_batches_for_mock_app(by_fc[FixtureClass.MOCK_APP]))
+ if "3" in tiers:
+ out.extend(_batches_for_live_gui(by_fc[FixtureClass.LIVE_GUI]))
+ if "H" in tiers:
+ out.extend(_batches_for_headless(by_fc[FixtureClass.HEADLESS]))
+ if "P" in tiers:
+ out.extend(_batches_for_performance(by_fc[FixtureClass.PERFORMANCE]))
+ out.sort(key=lambda b: (_TIER_ORDER.index(b.tier), b.label))
+ return out
+```
+
+- [ ] **Step 2: Run all 5 tests, confirm they pass**
+
+Run: `uv run pytest tests/test_batcher.py -v`
+Expected: All 5 tests PASS.
+
+- [ ] **Step 3: Commit (green)**
+
+```bash
+git add scripts/test_batcher.py
+git commit -m "feat(batcher): implement Batch dataclass and plan() function"
+```
+
+---
+
+## Task 1.11: Write red tests for scripts/pytest_collection_order.py
+
+**Files:**
+- Create: `tests/test_pytest_collection_order.py`
+
+- [ ] **Step 1: Create test file with 2 failing tests for the plugin**
+
+```python
+import textwrap
+from pathlib import Path
+import pytest
+
+def test_no_op_without_registry(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
+ monkeypatch.chdir(tmp_path)
+ (tmp_path / "test_zz.py").write_text(
+ "def test_b(): pass\ndef test_a(): pass\n", encoding="utf-8"
+ )
+ from scripts.pytest_collection_order import sort_items_by_order
+ items = []
+ result = sort_items_by_order(items, registry_path=tmp_path / "reg.toml")
+ assert result == items
+
+def test_sorts_by_order_index(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
+ monkeypatch.chdir(tmp_path)
+ (tmp_path / "test_zz.py").write_text(
+ "def test_b(): pass\ndef test_a(): pass\n", encoding="utf-8"
+ )
+ reg = tmp_path / "reg.toml"
+ reg.write_text(
+ "[[files.test_zz.test_order]]\n"
+ "test_id = 'test_zz::test_b'\n"
+ "order = 1\n"
+ "[[files.test_zz.test_order]]\n"
+ "test_id = 'test_zz::test_a'\n"
+ "order = 2\n",
+ encoding="utf-8",
+ )
+ class _Item:
+ def __init__(self, nodeid: str) -> None:
+ self.nodeid = nodeid
+ def __repr__(self) -> str:
+ return f"_Item({self.nodeid!r})"
+ items = [_Item("test_zz::test_b"), _Item("test_zz::test_a")]
+ from scripts.pytest_collection_order import sort_items_by_order
+ result = sort_items_by_order(items, registry_path=reg)
+ assert [i.nodeid for i in result] == ["test_zz::test_b", "test_zz::test_a"]
+```
+
+- [ ] **Step 2: Run, confirm 2 tests fail**
+
+Run: `uv run pytest tests/test_pytest_collection_order.py -v`
+Expected: 2 tests FAIL with `ImportError: cannot import name 'sort_items_by_order'`.
+
+- [ ] **Step 3: Commit (red)**
+
+```bash
+git add tests/test_pytest_collection_order.py
+git commit -m "test(collection_order): add red tests for opt-in sort_items_by_order"
+```
+
+---
+
+## Task 1.12: Implement scripts/pytest_collection_order.py
+
+**Files:**
+- Create: `scripts/pytest_collection_order.py`
+
+- [ ] **Step 1: Create the file with the sort function and pytest hook**
+
+```python
+from pathlib import Path
+import tomllib
+
+def _load_order_map(registry_path: Path) -> dict[str, dict[str, int]]:
+ if not registry_path.exists():
+ return {}
+ with registry_path.open("rb") as f:
+ data = tomllib.load(f)
+ files = data.get("files", {})
+ out: dict[str, dict[str, int]] = {}
+ for fname, entry in files.items():
+ order_list = entry.get("test_order", [])
+ if isinstance(order_list, list):
+ out[fname] = {item["test_id"]: int(item["order"]) for item in order_list}
+ elif isinstance(order_list, dict):
+ out[fname] = {k: int(v) for k, v in order_list.items()}
+ return out
+
+def sort_items_by_order(items: list, registry_path: Path) -> list:
+ order_map = _load_order_map(registry_path)
+ if not order_map:
+ return list(items)
+ by_file: dict[str, list] = {}
+ for it in items:
+ nodeid = getattr(it, "nodeid", "")
+ fname = nodeid.split("::", 1)[0] if "::" in nodeid else ""
+ by_file.setdefault(fname, []).append(it)
+ out: list = []
+ for fname, group in by_file.items():
+ fmap = order_map.get(fname)
+ if not fmap:
+ out.extend(group)
+ continue
+ def _key(it) -> tuple[int, int]:
+ nid = getattr(it, "nodeid", "")
+ idx = fmap.get(nid, 1 << 30)
+ return (idx, group.index(it))
+ out.extend(sorted(group, key=_key))
+ return out
+
+def pytest_collection_modifyitems(config, items) -> None:
+ tests_dir = Path(getattr(config, "rootdir", Path.cwd()))
+ registry_path = tests_dir / "test_categories.toml"
+ new_items = sort_items_by_order(list(items), registry_path=registry_path)
+ items[:] = new_items
+```
+
+- [ ] **Step 2: Run, confirm 2 tests pass**
+
+Run: `uv run pytest tests/test_pytest_collection_order.py -v`
+Expected: 2 tests PASS.
+
+- [ ] **Step 3: Commit (green)**
+
+```bash
+git add scripts/pytest_collection_order.py
+git commit -m "feat(collection_order): implement opt-in per-test sort via conftest hook"
+```
+
+---
+
+## Task 1.13: Wire the plugin in tests/conftest.py
+
+**Files:**
+- Modify: `tests/conftest.py:1-30` (read first; this file is 250+ lines)
+
+- [ ] **Step 1: Read the first 30 lines of conftest.py to understand structure**
+
+Run: `Read tests/conftest.py (first 30 lines)` via MCP `manual-slop_get_file_slice path=tests/conftest.py start_line=1 end_line=30`. Identify where module-level imports and pytest_plugins are declared.
+
+- [ ] **Step 2: Add `pytest_plugins` line (or extend if it exists)**
+
+If `tests/conftest.py` does NOT already have a `pytest_plugins` line, ADD this line near the top (after imports, before any fixtures):
+
+```python
+pytest_plugins = ["scripts.pytest_collection_order"]
+```
+
+If `pytest_plugins` already exists, APPEND `"scripts.pytest_collection_order"` to the list.
+
+Use the surgical edit tool (`manual-slop_edit_file` with the appropriate `old_string`/`new_string`) to make this change with **1-space indentation** and no comments.
+
+- [ ] **Step 3: Run the full test suite to confirm no regressions**
+
+Run: `uv run pytest tests/ --ignore=tests/test_categorizer_smoke.py -x -q --timeout=60 2>&1 | tail -30`
+Expected: most tests pass; the new categorizer + batcher + plugin tests pass; pre-existing failures (if any) match the known baseline from `conductor/tracks/regression_fixes_20260605/`.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add tests/conftest.py
+git commit -m "test(conftest): register scripts.pytest_collection_order as pytest plugin"
+```
+
+---
+
+## Task 1.14: Implement scripts/run_tests_batched.py with --plan and --audit modes only
+
+**Files:**
+- Modify: `scripts/run_tests_batched.py` (replace contents)
+
+- [ ] **Step 1: Replace the file with the new orchestrator (Phase 1 stub: no actual pytest execution)**
+
+```python
+import argparse
+import sys
+from pathlib import Path
+
+from scripts.test_categorizer import categorize_all
+from scripts.test_batcher import plan
+
+def _print_plan(records, options) -> int:
+ batches = plan(records, include_opt_in=options.include_opt_in, xdist=not options.no_xdist)
+ for b in batches:
+ status = "SKIP" if b.skip_reason else "RUN"
+ print(f"[{status}] {b.label}: {len(b.files)} files, est {b.estimated_seconds:.1f}s, args={b.pytest_args}")
+ if b.skip_reason:
+ print(f"    reason: {b.skip_reason}")
+ return 0
+
+def _print_audit(records, strict: bool) -> int:
+ auto = [r for r in records if r.source == "auto"]
+ print(f"Auto-inferred (unclassified) records: {len(auto)}")
+ for r in auto:
+ print(f" {r.filename}: fc={r.fixture_class.value}, subs={r.subsystems}, bg={r.batch_group}")
+ if strict:
+ bad = [r for r in auto if len(r.subsystems) > 1]
+ if bad:
+ print(f"STRICT: {len(bad)} auto-inferred files have multiple subsystems (probably cross-cutting):")
+ for r in bad:
+ print(f" {r.filename}: subs={r.subsystems}")
+ return 1
+ return 0
+
+def main() -> int:
+ p = argparse.ArgumentParser()
+ p.add_argument("--tests-dir", default="tests")
+ p.add_argument("--registry", default="tests/test_categories.toml")
+ p.add_argument("--tiers", default="1,2,3,H")
+ p.add_argument("--include-opt-in", action="store_true")
+ p.add_argument("--no-xdist", action="store_true")
+ p.add_argument("--plan", action="store_true")
+ p.add_argument("--audit", action="store_true")
+ p.add_argument("--strict", action="store_true")
+ options = p.parse_args()
+ records = categorize_all(Path(options.tests_dir), Path(options.registry))
+ if options.audit:
+ return _print_audit(records, strict=options.strict)
+ if options.plan:
+ return _print_plan(records, options)
+ print("Phase 1 stub: no actual test execution yet. Use --plan or --audit.")
+ return 0
+
+if __name__ == "__main__":
+ sys.exit(main())
+```
+
+- [ ] **Step 2: Verify --plan output**
+
+Run: `python scripts/run_tests_batched.py --plan 2>&1 | head -30`
+Expected: prints a list of batches with labels like `[RUN] tier-1-unit-core: 42 files, est 126.0s, args=['-n', 'auto', '--maxfail=10']` etc.
+
+- [ ] **Step 3: Verify --audit output**
+
+Run: `python scripts/run_tests_batched.py --audit 2>&1 | head -20`
+Expected: prints `Auto-inferred (unclassified) records: 275` (or similar; all files are auto-inferred because the registry is empty) followed by per-file lines.
+
+- [ ] **Step 4: Verify --audit --strict exits non-zero (when cross-cutting auto-classification occurs)**
+
+Run: `python scripts/run_tests_batched.py --audit --strict 2>&1 | tail -5; echo "exit: $?"`
+Expected: prints STRICT violations (if any auto-inferred file has multiple subsystems — likely 0 in Phase 1) and exits 0 OR 1 depending on whether any cross-cutting auto-inferred files exist.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scripts/run_tests_batched.py
+git commit -m "feat(run_tests_batched): add --plan and --audit modes (Phase 1 stub)"
+```
+
+---
+
+## Task 1.15: Manually verify --plan output matches the spec
+
+**Files:** none (manual verification only)
+
+- [ ] **Step 1: Verify the spec invariants from Section 3.3 of spec.md**
+
+Run:
+```bash
+python scripts/run_tests_batched.py --plan 2>&1 | grep -E "tier-(0|1|2|3|H|P)"
+```
+
+Expected: 
+- Exactly ONE `tier-3-live_gui` batch (containing all 14 `*_sim.py` files)
+- `tier-0-opt_in-clean_install` AND `tier-0-opt_in-docker` both SKIP (no env var)
+- `tier-1-unit-*` batches grouped by subsystem batch_group (core, gui, mma, comms, headless)
+- `tier-2-mock_app-*` batches (if any tests use mock_app)
+- `tier-H-headless` and `tier-P-performance` (if any tests match)
+
+- [ ] **Step 2: Confirm no live_gui test appears in a non-tier-3 batch**
+
+Run:
+```bash
+python scripts/run_tests_batched.py --plan 2>&1 | grep -E "_sim\.py" | grep -v "tier-3"
+```
+
+Expected: NO output (zero matches). Every `*_sim.py` file is in the tier-3 batch only.
+
+- [ ] **Step 3: Document the verification in a short note for git history**
+
+If any invariant is violated, STOP and debug. Otherwise, proceed to commit. No commit is needed in this task; the verification is captured in the Phase 1 checkpoint git note.
+
+---
+
+## Task 1.16: Phase 1 checkpoint commit and git note
+
+**Files:** none (commit + note only)
+
+- [ ] **Step 1: Confirm all Phase 1 tests pass**
+
+Run: `uv run pytest tests/test_categorizer.py tests/test_batcher.py tests/test_pytest_collection_order.py -v`
+Expected: All 20 tests (13 + 5 + 2) PASS.
+
+- [ ] **Step 2: Confirm the full test suite still works (no regressions from plugin wiring)**
+
+Run: `uv run pytest tests/ --ignore=tests/test_audit_main_thread_imports.py -q --timeout=60 2>&1 | tail -15`
+Expected: existing test results match the pre-Phase-1 baseline (any pre-existing failures are unchanged; no NEW failures introduced).
+
+- [ ] **Step 3: Create the checkpoint commit (if there are uncommitted changes) and attach git note**
+
+```bash
+git add -A
+if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 1 complete - library + dry-run modes"; fi
+SHA=$(git log -1 --format="%H")
+git notes add -m "Phase 1 checkpoint: test_batching_refactor_20260606
+
+Library + dry-run complete:
+- scripts/test_categorizer.py: 13 tests, all auto-inference rules
+- scripts/test_batcher.py: 5 tests, deterministic plan() with 6 tiers
+- scripts/pytest_collection_order.py: 2 tests, opt-in per-test sort
+- scripts/run_tests_batched.py: --plan and --audit modes (no execution)
+- tests/conftest.py: plugin registered (no-op without entries)
+
+Manually verified: all 14 *_sim.py files are in ONE tier-3 batch.
+Opt-in tests SKIP cleanly without env var.
+
+Next: Phase 2 (shadow run via CI)." "$SHA"
+```
+
+- [ ] **Step 4: Update state.toml phase_1 status to checkpoint**
+
+Edit `conductor/tracks/test_batching_refactor_20260606/state.toml` line:
+```toml
+phase_1 = { status = "in_progress", checkpoint_sha = "", name = "Library + dry-run modes" }
+```
+Change to:
+```toml
+phase_1 = { status = "completed", checkpoint_sha = "<first-7-of-SHA>", name = "Library + dry-run modes" }
+```
+
+Then:
+```bash
+git add conductor/tracks/test_batching_refactor_20260606/state.toml
+git commit -m "conductor(plan): mark Phase 1 complete in test_batching_refactor_20260606"
+```
+
+---
+
+# Phase 2: Shadow run
+
+> Goal: Run the new script in CI as a non-blocking informational job. Compare its pass/fail signature to the old script's. Investigate any divergence.
+
+---
+
+## Task 2.1: Add a CI workflow job for the shadow run
+
+**Files:**
+- Create: `.github/workflows/test_batching_shadow.yml` (or equivalent CI config file in this repo's CI location)
+
+- [ ] **Step 1: Identify the CI configuration location**
+
+Check the repo for existing CI files. Look for `.github/workflows/`, `.gitlab-ci.yml`, or similar. If a CI directory exists, follow its conventions.
+
+- [ ] **Step 2: Create a non-blocking job that runs `python scripts/run_tests_batched.py --plan` and uploads the output as an artifact**
+
+```yaml
+name: test-batching-shadow
+on: [push, pull_request]
+jobs:
+ shadow:
+ runs-on: ubuntu-latest
+ continue-on-error: true
+ steps:
+ - uses: actions/checkout@v4
+ - uses: actions/setup-python@v5
+ with:
+ python-version: "3.11"
+ - run: pip install uv
+ - run: uv sync
+ - run: python scripts/run_tests_batched.py --plan > plan.txt 2>&1
+ - run: python scripts/run_tests_batched.py --audit > audit.txt 2>&1
+ - uses: actions/upload-artifact@v4
+ with:
+ name: test-batching-plan
+ path: |
+ plan.txt
+ audit.txt
+```
+
+Adjust the runner / step names to match the repo's conventions.
+
+- [ ] **Step 3: Commit and push to a feature branch**
+
+```bash
+git checkout -b test_batching_shadow_ci
+git add .github/workflows/test_batching_shadow.yml
+git commit -m "ci: add test batching shadow run (informational, non-blocking)"
+git push -u origin test_batching_shadow_ci
+```
+
+- [ ] **Step 4: Open a PR and observe 1+ CI runs**
+
+(Manual) Verify the shadow job runs and uploads artifacts. Compare `plan.txt` and `audit.txt` to manual expectations.
+
+---
+
+## Task 2.2: Investigate and fix any categorizer/batcher divergence
+
+**Files:** as needed (likely `scripts/test_categorizer.py` or `scripts/test_batcher.py`)
+
+- [ ] **Step 1: Compare the shadow job's `plan.txt` against the manually-verified plan from Task 1.15**
+
+If they match, skip to Task 2.3. If they diverge, identify the source.
+
+- [ ] **Step 2: Add a regression test for the divergence**
+
+If a real bug was found, write a failing test that reproduces it BEFORE fixing.
+
+- [ ] **Step 3: Fix the bug**
+
+Follow TDD: red → green → commit.
+
+---
+
+## Task 2.3: Phase 2 checkpoint
+
+**Files:** none (commit + note only)
+
+- [ ] **Step 1: Confirm the shadow job has been green for at least 1 week of CI runs**
+
+(Manual) If the shadow job has not been green for 1 week, wait. If 1+ week has passed without divergence, proceed.
+
+- [ ] **Step 2: Create the checkpoint commit and git note**
+
+```bash
+git add -A
+if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 2 complete - shadow run validated"; fi
+SHA=$(git log -1 --format="%H")
+git notes add -m "Phase 2 checkpoint: shadow run validated. No divergence between new and old scripts over 1+ week of CI runs. Ready to switch default." "$SHA"
+```
+
+- [ ] **Step 3: Update state.toml phase_2 status**
+
+Edit `state.toml` phase_2 line: change status to `"completed"`, fill in `checkpoint_sha`.
+
+```bash
+git add conductor/tracks/test_batching_refactor_20260606/state.toml
+git commit -m "conductor(plan): mark Phase 2 complete in test_batching_refactor_20260606"
+```
+
+---
+
+# Phase 3: Switch default
+
+> Goal: Replace the old `run_tests_batched.py` with the new one (with full CLI including `--tiers`, `--include-opt-in`, `--durations` recording). Update `docs/guide_testing.md`. Keep old script as `.legacy` for one cycle.
+
+---
+
+## Task 3.1: Add --tiers, --durations, and full execution logic to run_tests_batched.py
+
+**Files:**
+- Modify: `scripts/run_tests_batched.py`
+
+- [ ] **Step 1: Extend the script to actually execute pytest per batch**
+
+Replace the entire contents of `scripts/run_tests_batched.py` with the full version that includes:
+- Tier parsing from `--tiers` (e.g., `--tiers 1,2,3`)
+- `--durations` flag to record `.test_durations.json`
+- Actual `subprocess.run(uv run pytest ...)` per batch
+- Summary table output (per Section 5 of spec.md)
+- Worst exit code returned
+
+The full implementation (~150 lines) follows the spec's Section 4.3:
+
+```python
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+from scripts.test_categorizer import categorize_all, FixtureClass
+from scripts.test_batcher import plan, Batch
+
+def _parse_tiers(s: str) -> set[str]:
+ return {t.strip() for t in s.split(",") if t.strip()}
+
+def _durations_path(tests_dir: Path) -> Path:
+ return tests_dir / ".test_durations.json"
+
+def _load_durations(p: Path) -> dict[str, float]:
+ if not p.exists():
+ return {}
+ try:
+ with p.open("r", encoding="utf-8") as f:
+ return json.load(f)
+ except (json.JSONDecodeError, OSError):
+ return {}
+
+def _save_durations(p: Path, durations: dict[str, float]) -> None:
+ tmp = p.with_suffix(".json.tmp")
+ with tmp.open("w", encoding="utf-8") as f:
+ json.dump(durations, f, indent=2, sort_keys=True)
+ tmp.replace(p)
+
+def _parse_durations_from_pytest_output(stdout: str) -> dict[str, float]:
+ out: dict[str, float] = {}
+ for line in stdout.splitlines():
+ line = line.strip()
+ if "::" not in line or " " not in line:
+ continue
+ parts = line.rsplit(None, 1)
+ if len(parts) != 2:
+ continue
+ nodeid, time_str = parts
+ try:
+ out[nodeid] = float(time_str.rstrip("s"))
+ except ValueError:
+ continue
+ return out
+
+def _run_batch(b: Batch, durations: dict[str, float]) -> tuple[int, float, dict[str, float]]:
+ if b.skip_reason:
+ return 0, 0.0, {}
+ cmd = ["uv", "run", "pytest", "-v", "--durations=0"] + b.pytest_args + [str(f) for f in b.files]
+ print(f"\n>>> Running {b.label} ({len(b.files)} files)")
+ t0 = time.monotonic()
+ proc = subprocess.run(cmd, capture_output=True, text=True)
+ elapsed = time.monotonic() - t0
+ new_durs = _parse_durations_from_pytest_output(proc.stdout)
+ print(proc.stdout[-2000:] if proc.returncode != 0 else f"<<< {b.label} PASS in {elapsed:.1f}s")
+ if proc.returncode != 0:
+ print(f"<<< {b.label} FAIL (exit {proc.returncode}) in {elapsed:.1f}s")
+ print(proc.stderr[-1000:])
+ return proc.returncode, elapsed, new_durs
+
+def _print_summary(results: list[tuple[Batch, int, float]]) -> int:
+ print("\n" + "=" * 60)
+ print("SUMMARY")
+ print("=" * 60)
+ worst = 0
+ for b, code, elapsed in results:
+ if b.skip_reason:
+ status = "SKIPPED"
+ elif code == 0:
+ status = "PASS"
+ else:
+ status = "FAIL"
+ worst = max(worst, code)
+ n = len(b.files)
+ print(f"[{b.tier}] {b.label:40s} {status:8s} {n} files {elapsed:6.1f}s")
+ return worst
+
+def main() -> int:
+ p = argparse.ArgumentParser()
+ p.add_argument("--tests-dir", default="tests")
+ p.add_argument("--registry", default="tests/test_categories.toml")
+ p.add_argument("--tiers", default="1,2,3,H")
+ p.add_argument("--include-opt-in", action="store_true")
+ p.add_argument("--no-xdist", action="store_true")
+ p.add_argument("--plan", action="store_true")
+ p.add_argument("--audit", action="store_true")
+ p.add_argument("--strict", action="store_true")
+ p.add_argument("--durations", action="store_true", help="Record per-test durations to .test_durations.json")
+ options = p.parse_args()
+ tiers = _parse_tiers(options.tiers)
+ tests_dir = Path(options.tests_dir)
+ durations_path = _durations_path(tests_dir)
+ durations = _load_durations(durations_path)
+ records = categorize_all(tests_dir, Path(options.registry))
+ if options.audit:
+ from scripts.run_tests_batched_helpers import print_audit
+ return print_audit(records, strict=options.strict)
+ batches = plan(records, tiers=tiers, include_opt_in=options.include_opt_in, xdist=not options.no_xdist)
+ if options.plan:
+ for b in batches:
+ status = "SKIP" if b.skip_reason else "RUN"
+ print(f"[{status}] {b.label}: {len(b.files)} files, est {b.estimated_seconds:.1f}s")
+ return 0
+ results: list[tuple[Batch, int, float]] = []
+ merged_durations = dict(durations)
+ for b in batches:
+ code, elapsed, new_durs = _run_batch(b, merged_durations)
+ results.append((b, code, elapsed))
+ merged_durations.update(new_durs)
+ if options.durations:
+ _save_durations(durations_path, merged_durations)
+ return _print_summary(results)
+
+if __name__ == "__main__":
+ sys.exit(main())
+```
+
+- [ ] **Step 2: Run --plan to confirm structure**
+
+Run: `python scripts/run_tests_batched.py --plan --tiers 1,2,3 2>&1 | head -10`
+Expected: prints tier-1, tier-2, tier-3 batches (no execution).
+
+- [ ] **Step 3: Run --tiers 1 to test the actual execution path on a small batch**
+
+Run: `python scripts/run_tests_batched.py --tiers 1 2>&1 | tail -20`
+Expected: runs all unit-tier batches and prints a SUMMARY table. Some tests may fail; that's OK (the script's exit code reflects test pass/fail, not the implementation).
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add scripts/run_tests_batched.py
+git commit -m "feat(run_tests_batched): full CLI with --tiers, --durations, actual pytest execution"
+```
+
+---
+
+## Task 3.2: Rename the old script to .legacy
+
+**Files:**
+- Rename: `scripts/run_tests_batched.py` (the old, pre-Phase-1 version) → `scripts/run_tests_batched.py.legacy`
+
+- [ ] **Step 1: Recover the old script from git history**
+
+The old 36-line version was committed at SHA `b7a97374^` (or earlier; check `git log --all --oneline -- scripts/run_tests_batched.py | head -5`). Recover it:
+
+```bash
+git log --oneline --all -- scripts/run_tests_batched.py | head -5
+git show <old_sha>:scripts/run_tests_batched.py > scripts/run_tests_batched.py.legacy
+```
+
+- [ ] **Step 2: Verify the .legacy file matches the original 36-line version**
+
+Run: `Get-Content scripts/run_tests_batched.py.legacy | Measure-Object -Line` (should be 36 lines).
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add scripts/run_tests_batched.py.legacy
+git commit -m "chore: preserve old run_tests_batched.py as .legacy for one cycle"
+```
+
+---
+
+## Task 3.3: Update docs/guide_testing.md
+
+**Files:**
+- Modify: `docs/guide_testing.md` (Section: "Running Tests")
+
+- [ ] **Step 1: Read the current "Running Tests" section**
+
+Use `manual-slop_get_file_slice path=docs/guide_testing.md start_line=388 end_line=448` (or grep for "## Running Tests" to find the exact range).
+
+- [ ] **Step 2: Replace the "All Tests" / "Specific Test File" subsections with new content referencing the new script**
+
+Append a new subsection before "By Marker":
+
+```markdown
+### Batched Run (Default for Local Development)
+
+The default for local development is the new categorized batcher:
+
+```bash
+python scripts/run_tests_batched.py
+```
+
+This runs 6 fixture-class-isolated tiers: opt-in (skipped unless `--include-opt-in`), unit (with pytest-xdist), mock_app, live_gui (one session), headless, performance. Each tier prints a summary line. Use `--plan` to see the batch plan without running; `--audit` to list unclassified files; `--tiers 1,2` to limit which tiers run.
+
+See `conductor/tracks/test_batching_refactor_20260606/spec.md` for the full design.
+```
+
+- [ ] **Step 3: Verify the docs render correctly**
+
+Run: `grep -A 2 "Batched Run" docs/guide_testing.md | head -10`
+Expected: shows the new section.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add docs/guide_testing.md
+git commit -m "docs(testing): document new run_tests_batched.py in Running Tests section"
+```
+
+---
+
+## Task 3.4: Phase 3 checkpoint
+
+**Files:** none (commit + note only)
+
+- [ ] **Step 1: Run the full new script and confirm the existing 273+ test suite still passes (modulo any pre-existing failures)**
+
+Run: `python scripts/run_tests_batched.py --tiers 1,2 2>&1 | tail -15`
+Expected: SUMMARY table with all tier-1 and tier-2 batches either PASS or have the same pre-existing failures as before.
+
+- [ ] **Step 2: Run live_gui tier separately**
+
+Run: `python scripts/run_tests_batched.py --tiers 3 2>&1 | tail -10`
+Expected: tier-3-live_gui batch runs all `*_sim.py` files; some may fail (pre-existing); no NEW failures.
+
+- [ ] **Step 3: Create the checkpoint commit and git note**
+
+```bash
+git add -A
+if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 3 complete - new script is default"; fi
+SHA=$(git log -1 --format="%H")
+git notes add -m "Phase 3 checkpoint: new run_tests_batched.py is the default. Old script preserved as .legacy. docs/guide_testing.md updated. Ready for Phase 4 cleanup." "$SHA"
+```
+
+- [ ] **Step 4: Update state.toml phase_3 status**
+
+Edit `state.toml` phase_3 line: change status to `"completed"`, fill in `checkpoint_sha`.
+
+```bash
+git add conductor/tracks/test_batching_refactor_20260606/state.toml
+git commit -m "conductor(plan): mark Phase 3 complete in test_batching_refactor_20260606"
+```
+
+---
+
+# Phase 4: Cleanup
+
+> Goal: Populate the registry with the ~30 cross-cutting / ambiguous files identified during the audit. Delete the legacy script. Add `.test_durations.json` to `.gitignore`. Archive the track.
+
+---
+
+## Task 4.1: Run --audit on a clean clone and collect cross-cutting files
+
+**Files:**
+- Create: `docs/test_categorization_audit_20260606.md` (intermediate; can be deleted or kept)
+
+- [ ] **Step 1: Run --audit --strict and capture output**
+
+Run:
+```bash
+python scripts/run_tests_batched.py --audit --strict > docs/test_categorization_audit_20260606.md 2>&1
+echo "exit: $?" >> docs/test_categorization_audit_20260606.md
+```
+
+- [ ] **Step 2: Review the audit output and identify ~30 cross-cutting / ambiguous files**
+
+Open the file and find:
+- Auto-inferred files with multiple subsystems (cross-cutting)
+- Auto-inferred files with empty subsystems (filename doesn't start with a known prefix; e.g., `test_z_negative_flows.py`, `test_subagent_summarization.py`)
+- Files where the auto-inferred batch_group feels wrong
+
+- [ ] **Step 3: Commit the audit report**
+
+```bash
+git add docs/test_categorization_audit_20260606.md
+git commit -m "conductor(audit): capture Phase 4 cross-cutting file audit"
+```
+
+---
+
+## Task 4.2: Populate tests/test_categories.toml with cross-cutting entries
+
+**Files:**
+- Create: `tests/test_categories.toml`
+
+- [ ] **Step 1: Create the registry file with the ~30 cross-cutting entries**
+
+```toml
+# Hand-curated registry for cross-cutting and ambiguous tests.
+# Auto-inferred records that are correct do NOT need entries here.
+# Generated 2026-06-06 from docs/test_categorization_audit_20260606.md.
+
+[files.test_gui_dag_beads]
+fixture_class = "live_gui"
+subsystems = ["gui", "dag", "beads"]
+batch_group = "gui"
+notes = "Cross-cutting: drives GUI, asserts on DAG state, exercises Beads backend"
+
+[files.test_arch_boundary_phase1]
+subsystems = ["architecture"]
+batch_group = "mma"
+notes = "Phase 1 of arch-boundary refactor; subsystem ambiguous (arch/phase), group = mma"
+
+[files.test_arch_boundary_phase2]
+subsystems = ["architecture"]
+batch_group = "mma"
+
+[files.test_arch_boundary_phase3]
+subsystems = ["architecture"]
+batch_group = "mma"
+
+[files.test_z_negative_flows]
+subsystems = ["misc"]
+batch_group = "core"
+notes = "Filename prefix 'z' is not a known subsystem; auto-inference fails"
+
+[files.test_subagent_summarization]
+subsystems = ["mma", "tiered"]
+batch_group = "mma"
+
+[files.test_tiered_aggregation]
+subsystems = ["mma", "tiered"]
+batch_group = "mma"
+
+[files.test_tiered_context]
+subsystems = ["mma", "tiered"]
+batch_group = "mma"
+
+[files.test_tier4_interceptor]
+subsystems = ["tier4", "mma"]
+batch_group = "mma"
+
+[files.test_tier4_patch_generation]
+subsystems = ["tier4", "mma"]
+batch_group = "mma"
+
+# Continue with ~20 more entries identified in the audit...
+# (The engineer filling this in should add entries for all files flagged
+# by --audit --strict in Task 4.1)
+```
+
+(The full ~30 entries are filled in by the implementer based on the audit output. The pattern above shows the schema.)
+
+- [ ] **Step 2: Re-run --audit; expect fewer auto-inferred records and zero strict violations**
+
+Run: `python scripts/run_tests_batched.py --audit --strict 2>&1 | tail -5; echo "exit: $?"`
+Expected: exit 0; STRICT violations are zero (or significantly reduced).
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add tests/test_categories.toml
+git commit -m "feat(tests): populate test_categories.toml with cross-cutting entries"
+```
+
+---
+
+## Task 4.3: Add .test_durations.json to .gitignore
+
+**Files:**
+- Modify: `.gitignore`
+
+- [ ] **Step 1: Read .gitignore and find a sensible place to add the entry**
+
+Run: `Get-Content .gitignore | Select-String -Pattern "tests" -SimpleMatch` to see existing test-related entries.
+
+- [ ] **Step 2: Append the new entry**
+
+Add (at the end, or grouped with other test artifact ignores):
+
+```
+# Local test duration cache (developer-local; regenerated on each batched run)
+tests/.test_durations.json
+```
+
+- [ ] **Step 3: Verify the file is now ignored**
+
+Run: `git check-ignore -v tests/.test_durations.json`
+Expected: prints `.gitignore:<line> tests/.test_durations.json` confirming the ignore.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add .gitignore
+git commit -m "chore: gitignore tests/.test_durations.json (developer-local cache)"
+```
+
+---
+
+## Task 4.4: Delete the legacy script
+
+**Files:**
+- Delete: `scripts/run_tests_batched.py.legacy`
+
+- [ ] **Step 1: Confirm no remaining references to the legacy script**
+
+Run: `rg "run_tests_batched.py.legacy" .` (or grep recursively)
+Expected: no matches (the legacy file should be unused).
+
+- [ ] **Step 2: Delete the file**
+
+Run: `Remove-Item scripts/run_tests_batched.py.legacy`
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add -A
+git commit -m "chore: delete legacy run_tests_batched.py (was preserved for one cycle)"
+```
+
+---
+
+## Task 4.5: Archive the track
+
+**Files:**
+- Move: `conductor/tracks/test_batching_refactor_20260606/` → `conductor/tracks/archive/test_batching_refactor_20260606/` (or the archive convention used by this repo)
+
+- [ ] **Step 1: Check the archive convention used in this repo**
+
+Run: `Get-ChildItem conductor/tracks/archive_completed_tracks_20260603 | Select-Object Name -First 3`
+Or: `Get-ChildItem conductor/tracks -Filter "*archive*" -Directory | Select-Object Name`
+
+Use the convention the project uses (likely a single archive directory or per-date directories).
+
+- [ ] **Step 2: Move the track directory**
+
+```bash
+git mv conductor/tracks/test_batching_refactor_20260606 conductor/tracks/archive/test_batching_refactor_20260606
+```
+
+(Adjust the destination path per Step 1.)
+
+- [ ] **Step 3: Update conductor/tracks.md to move the entry to "Recently Completed" or the equivalent section**
+
+Edit `conductor/tracks.md`: move the line for `test_batching_refactor_20260606` from the "Remaining Backlog" section to "Recently Completed Tracks (2026-06+)" or similar. Change status from `[~]` (in progress) to `[x]` (completed).
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add conductor/tracks.md
+git mv conductor/tracks/test_batching_refactor_20260606 conductor/tracks/archive/test_batching_refactor_20260606
+git commit -m "conductor(archive): ship test_batching_refactor_20260606 to archive"
+```
+
+---
+
+## Task 4.6: Phase 4 checkpoint (final)
+
+**Files:** none (commit + note only)
+
+- [ ] **Step 1: Run the full new script and confirm zero STRICT audit violations and all tests still pass**
+
+Run:
+```bash
+python scripts/run_tests_batched.py --audit --strict; echo "audit exit: $?"
+python scripts/run_tests_batched.py --tiers 1,2 2>&1 | tail -10
+```
+
+Expected: audit exits 0; tier-1 and tier-2 batches PASS or have only the same pre-existing failures as the baseline.
+
+- [ ] **Step 2: Create the final checkpoint commit and git note**
+
+```bash
+git add -A
+if ! git diff --cached --quiet; then git commit -m "conductor(checkpoint): Phase 4 complete - track shipped to archive"; fi
+SHA=$(git log -1 --format="%H")
+git notes add -m "Phase 4 checkpoint (TRACK COMPLETE): test_batching_refactor_20260606
+
+- tests/test_categories.toml populated with ~30 cross-cutting entries
+- .test_durations.json added to .gitignore
+- scripts/run_tests_batched.py.legacy deleted
+- Track archived to conductor/tracks/archive/
+
+Final state: scripts/run_tests_batched.py is the new default; --plan/--audit
+modes work; --tiers filters by tier; --include-opt-in gates opt-in tests;
+--durations records developer-local timing cache. live_gui tests in ONE
+pytest invocation (15s startup amortized). pytest-xdist for unit tier.
+
+No regressions in 273+ existing tests." "$SHA"
+```
+
+- [ ] **Step 3: Update state.toml final status**
+
+Edit `state.toml`:
+- `current_phase = 4` (or remove this field)
+- All phase_N entries: `status = "completed"`, `checkpoint_sha` filled in
+- Add a final note at the bottom: `# Track completed 2026-06-06 and archived.`
+
+```bash
+git add conductor/tracks/test_batching_refactor_20260606/state.toml
+git commit -m "conductor(plan): mark Phase 4 complete in test_batching_refactor_20260606"
+```
+
+---
+
+# Self-Review
+
+**1. Spec coverage:**
+- Section 1 (Problem Statement) — addressed by Task 1.14 / 3.1 (replace alphabetical with categorized).
+- Section 2 (Goals) — B (process isolation): Task 1.10 / 3.1 (one batch per tier). A (subsystem grouping): Task 1.5 (subsystem inference) + 1.10 (batch_group). C (xdist + session reuse): Task 1.10 + 3.1.
+- Section 3 (Architecture) — 3-tier model: Task 1.10 (`plan()`). Registry: Task 1.7. Auto-inference: Tasks 1.3-1.5.
+- Section 4 (Components) — categorizer: Tasks 1.1-1.7. batcher: Task 1.10. CLI orchestrator: Tasks 1.14 + 3.1. plugin: Tasks 1.11-1.13.
+- Section 5 (Output) — implemented in Task 3.1 (`_print_summary`).
+- Section 6 (CLI) — `--tiers`, `--include-opt-in`, `--plan`, `--audit`, `--strict`, `--no-xdist` all wired in Tasks 1.14 + 3.1.
+- Section 7 (Config) — `pyproject.toml` markers (already correct; no change). `.test_durations.json`: Task 4.3.
+- Section 8 (Migration) — all 4 phases mapped to plan phases.
+- Section 9 (Risks) — auto-inference misclassification: covered by `--audit --strict` (Task 1.14). Tier-3 crash: `--maxfail=1` (Task 1.10). xdist non-determinism: noted in Task 3.1's verification. New tests unclassified: `--audit` (Task 1.14).
+- Section 10 (Open Questions) — Q1: registry in `tests/` (Task 4.2). Q2: batch_group inferred by default (Task 1.5). Q3: deferred. Q4: per-run by default (Task 3.1).
+
+**2. Placeholder scan:** No "TBD", "TODO", "implement later". One intentional "[Engineer fills in ~20 more entries]" in Task 4.2 with a clear pattern to follow.
+
+**3. Type consistency:** `FixtureClass`, `Speed`, `CategoryRecord` defined in Task 1.1; used in all subsequent categorizer tests and in `plan()` (Task 1.10). `Batch` defined in Task 1.10; used in Task 3.1's `_run_batch` and `_print_summary`. `sort_items_by_order` signature stable across Tasks 1.11 and 1.12.
+
+No issues found. Plan ready for execution.