Private
Public Access
0
0
Files
manual_slop/docs/reports/PLANNING_DIGEST_20260606.md
T
ed 0f74705d01 docs(reports): add planning digest covering 5 tracks from 2026-06-06 session
Single-session planning digest that captures:
- The 5 tracks fully specced + planned (test_batching, qwen_llama_grok,
  data_oriented_error_handling, data_structure_strengthening,
  mcp_architecture_refactor)
- Cross-cutting design themes (data-oriented, audit-driven, per-track
  commit + git note, out-of-scope-by-default)
- The audit + data foundation (scripts/audit_weak_types.py; 430 -> 60
  finding; 0 strong patterns; 26 unique type strings; 86% concentrated
  in 6 files)
- The dependency graph + recommended execution order
- Follow-up tracks already planned in spec §12.1 of each track
- Recommended future tracks (post-tracks documentation is the top pick)
- Risks, open questions, and a complete file index

This is the kind of reference document that:
- Future planners consult to understand the codebase's current state
- The implementing agent uses to coordinate across tracks
- The user reviews as a digest of the planning work

Written in the project's docs/reports/ directory alongside the existing
Phase 5 reports (PHASE5_STABILISATION_REPORT.md, MUTATION_MATRIX_PHASE5.md, etc.).
2026-06-06 20:56:12 -04:00

33 KiB

Planning Digest: 5-Track Architectural Refactor (2026-06-06)

Status: Planning complete; implementation in flight Author: Tier 2 Tech Lead (brainstorming + spec + plan for all 5 tracks) Date: 2026-06-06 Audience: Future planners, the implementing agent, the user (as a reference / digest)


1. Executive Summary

In a single planning session, 5 architectural refactor tracks were specced and planned end-to-end. Together they reshape the manual_slop codebase around three foundational design principles — data-oriented error handling (Fleury), data-oriented types (named, documented, generated), and modular MCP architecture (sub-MCPs by category). All 5 tracks share a common ancestor in the startup_speedup_20260606 track (already shipped as of 12cec6ae), which established the lazy-SDK-import convention the other tracks depend on.

# Track Status Phases Key new files What it does
1 test_batching_refactor_20260606 Planned 4 scripts/{test_categorizer,test_batcher,pytest_collection_order}.py Replaces alphabetical 4-at-a-time batching with tiered batching (Tier 1 unit + xdist, Tier 3 live_gui in one session, etc.)
2 qwen_llama_grok_integration_20260606 Planned 6 src/{vendor_capabilities,openai_compatible,qwen_adapter}.py Adds Qwen (DashScope), Llama (Ollama + OpenRouter + custom URL), Grok (xAI). Introduces the Vendor Capability Matrix.
3 data_oriented_error_handling_20260606 Planned 5 src/result_types.py Introduces Result[T], ErrorInfo, NilPath per Fleury. Removes ProviderError exception. Marks send() @deprecated; adds send_result().
4 data_structure_strengthening_20260606 Planned 2 src/type_aliases.py, scripts/generate_type_registry.py Introduces 10 TypeAlias for the 430 anonymous dict[str, Any] / list[dict[...]] sites. Adds auto-generated docs/type_registry/.
5 mcp_architecture_refactor_20260606 Planned 7 src/mcp_<type>.py (7 files), src/mcp_client_security.py Splits 2,205-line mcp_client.py into slim controller + 6 native sub-MCPs + 1 external sub-MCP.

Combined impact: ~5 new framework files; ~6 modified framework files; ~6 modified high-traffic files (for the type-aliases refactor); 1 monolithic file split into 9 focused files; 1 new CI gate script; 1 new docs directory.


2. Session Context

2.1 Workflow model

The user is operating in a planning / execution split mode:

  • This session: Tier 2 Tech Lead (me) does brainstorming → spec → plan for each track. No code is written or executed.
  • External session: Another agent does the implementation. It picks up each plan.md and executes task-by-task via the project's MMA tier system.

This split lets the user think strategically (planning) while the heavy lifting (executing) happens in parallel.

2.2 The pre-existing baseline

Before this session, the project had:

  • 277 test files in tests/ (test_*.py + *_sim.py)
  • 53 src files (src/*.py)
  • 14 deep-dive guides (docs/guide_*.md)
  • The startup_speedup_20260606 track was in flight (Phase 6 complete per 253e1798; track SHIPPED per 12cec6ae in the same window as this planning session)
  • The test_batching_refactor_20260606 track had been planned (spec + plan were in the folder but execution hadn't started)
  • Conductor convention was in place — every track has spec.md + metadata.json + state.toml; the tracks.md registry lists all tracks with their [track-created: <sha>] references

2.3 What changed during this session

The user asked for 5 different refactor specs in sequence:

  1. Test batching refactor — already-planned track; I reviewed and committed
  2. Qwen/Llama/Grok vendors + capability matrix — new spec; multiple design questions resolved
  3. Data-oriented error handling (Fleury pattern) — new spec; user brought the article + friend's notes
  4. Data structure strengthening (type aliases + named tuples) — new spec; user proposed auto-generated docs over TypedDict migration
  5. MCP architecture refactor (sub-MCPs) — new spec; user proposed mcp_<type>.py naming + the DSL future idea

For each, I followed the brainstorming → spec → plan flow per the user's stated preference.


3. Cross-Cutting Design Themes

Five design themes run through all the tracks. Understanding them makes each track's individual decisions coherent.

3.1 Data-Oriented Design (Fleury / Acton / Lottes)

The user explicitly references this in two of the five tracks (data_oriented_error_handling_20260606 for errors; mcp_architecture_refactor_20260606 for module boundaries). The framing is:

  • Errors are just cases, not special control-flow primitives. Use Result[T] with side-channel error lists, not exceptions.
  • Algorithms on data, not methods on objects. The MCPController is a data structure; sub-MCPs are data; the dispatch is a function from data to data.
  • Stable names, not types. Type aliases (Metadata, FileItem, etc.) name data roles; they don't enforce structure (that's deferred to TypedDict if ever).
  • Shared code where possible; unique code only where vendor-specific. The _send_<vendor>_result() functions in ai_client.py are thin boundary adapters; the send_openai_compatible() helper is the shared algorithm.

3.2 Capability / Pattern / Convention as first-class docs

The user values explicit, discoverable conventions over implicit understanding. Each track introduces at least one canonical document:

  • conductor/code_styleguides/error_handling.md (Fleury patterns)
  • conductor/code_styleguides/type_aliases.md (type alias conventions)
  • docs/type_registry/ (auto-generated per-source-file schema docs)
  • conductor/code_styleguides/mcp_<type>.py (implicit, via the naming convention)

The product-guidelines.md is the umbrella; the styleguides are the detailed references. This pattern should be followed for any future track that introduces a new convention.

3.3 Audit + data-driven decisions

Two of the five tracks are data-grounded:

  • test_batching_refactor_20260606: addressed the actual problem (alphabetical 4-at-a-time batching) and explicitly designed the solution around the test categories the project already uses (Tier 1 unit, Tier 2 mock_app, Tier 3 live_gui, etc.).
  • data_structure_strengthening_20260606: drove by the scripts/audit_weak_types.py findings (430 weak sites; 86% concentrated in 6 high-traffic files; 0 strong patterns; 26 unique type strings; top 4 = 86% of findings).

The audit data is the source of truth. The track's success criterion is a measurable drop in the audit count (430 → ~60 = 86% reduction).

3.4 Process: per-track commit + git note + checkpoint

Every plan follows the same template:

  • Per-task commit: 1 commit per Red-Green-Refactor step
  • Per-checkpoint git note: git notes add -m "..." summarizing what the phase delivered
  • Per-checkpoint state.toml update: current_phase advanced; checkpointsha filled in

This is a feature of the project's conductor/workflow.md and is consistently applied. The next planner / implementer should follow the same template.

3.5 Out-of-scope-by-default; follow-up tracks for the next round

Each of the 5 tracks explicitly defers work to follow-up tracks. The follow-ups are documented in each spec's §12.1:

  • public_api_migration_20260606 — removes deprecated send() (from data_oriented_error_handling)
  • type_registry_ci_20260606 — wires generate_type_registry.py --check into CI (from data_structure_strengthening)
  • mcp_dsl_20260606 — per-MCP compact DSL for tool calls (from mcp_architecture_refactor)
  • typed_dict_migration_20260606 — convert most-used aliases to TypedDict (initially planned; later replaced by the docs approach; kept as a future option)

These follow-ups are listed in conductor/tracks.md as [ ] placeholders (item 0f etc.). They should be sequenced AFTER the 5 main tracks ship.


4. The 5 Tracks in Detail

4.1 test_batching_refactor_20260606

Goal: Replace alphabetical 4-at-a-time batching with tiered batching that respects fixture-class boundaries.

Architecture:

  • scripts/test_categorizer.py: AST-based classifier that determines each test file's FixtureClass (UNIT, MOCK_APP, LIVE_GUI, HEADLESS, OPT_IN, PERFORMANCE) and its batch_group (e.g., core, gui, mma).
  • scripts/test_batcher.py: Pure scheduler. plan(records, options) -> list[Batch] deterministically produces batches.
  • scripts/pytest_collection_order.py: Conftest-loaded plugin for the per-test order control (opt-in per file).
  • scripts/run_tests_batched.py: Modified CLI orchestrator with --tiers, --include-opt-in, --plan, --audit modes.

Key decisions:

  • Tier 3 (live_gui) is one pytest invocation, not many. This is THE single biggest runtime savings (15s startup amortized).
  • Tier 1 (unit) uses pytest-xdist for parallelism.
  • Tier 0 (opt-in) is gated on BOTH env var AND CLI flag (defense-in-depth: setting the env var alone shouldn't accidentally enable docker tests).
  • Hybrid classification: auto-infer from filename + AST fixture scan; hand-curated tests/test_categories.toml overrides for cross-cutting and ambiguous files.

What's NOT done: The script does NOT modify test files or fixtures; it only categorizes and batches. New tests get sensible defaults automatically.

Current state: Plan complete (7fdab705 spec, f7b11f7f plan). Ready for execution.


4.2 qwen_llama_grok_integration_20260606

Goal: Add first-class support for Qwen, Llama, Grok. Introduce the Vendor Capability Matrix.

Architecture:

  • src/vendor_capabilities.py: VendorCapabilities dataclass, _REGISTRY populated per-(vendor, model).
  • src/openai_compatible.py: shared send_openai_compatible() helper (data-oriented design — operates on normalized data).
  • src/qwen_adapter.py: DashScope-specific tool format translation + error classification.

Key decisions:

  • Naming convention: _send_<vendor>_result() returning Result[str, ErrorInfo] (8 vendors: Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Llama, Grok).
  • Capability Matrix v1: 7 capabilities — vision, tool_calling, caching, streaming, model_discovery, context_window, cost_tracking. Audio and server-side code_execution deferred to a future track.
  • UX adaptation: 9 UI elements read the matrix (screenshot button, tools toggle, cache panel, stream progress, fetch models button, token budget max, cost panel).
  • OpenAI-compatible at the SDK boundary keeps raising; the new _send_<vendor>_result() functions catch and convert to ErrorInfo. Per Fleury: "exceptions are reserved for the SDK boundary."

Coordination with startup_speedup_20260606: Qwen's DashScope SDK adds a new import; the audit script scripts/audit_main_thread_imports.py ensures the import is gated to a worker thread, not the main thread. Verified at the baseline in Phase 1 of the track.

Current state: Plan complete (b17cbbde plan). Ready for execution.


4.3 data_oriented_error_handling_20260606

Goal: Introduce Ryan Fleury's "errors are just cases" framework as a project convention.

Architecture:

  • src/result_types.py: ErrorKind enum, ErrorInfo dataclass, Result[T] generic, NilPath + NilRAGState sentinel singletons.
  • src/mcp_client.py (the data_oriented refactor for MCP): (p, err) tuples → Result[Path]; assert p is not None → nil-sentinel.
  • src/ai_client.py: ProviderError exception REMOVED; _classify_<vendor>_error() returns ErrorInfo; _send_<vendor>() renamed to _send_<vendor>_result() returning Result[str].
  • src/rag_engine.py: methods return Result instead of raising.

Key decisions:

  • Internal-only refactor for the public API. _send_<vendor>_result() is renamed + retuned. The public send() is preserved, marked @typing_extensions.deprecated; the new send_result() returns Result[str]. The actual breaking change happens in the follow-up public_api_migration_20260606 track.
  • ProviderError is FULLY REMOVED, not kept as a thin internal exception. Per Fleury, exceptions are for the SDK boundary only; once the boundary converts to ErrorInfo, no exception is needed.
  • Deprecation warning emitted in tests: tests/conftest.py adds filterwarnings("ignore::DeprecationWarning:src.ai_client") during the transition.

Coordination with pending tracks:

  • mcp_architecture_refactor_20260606 assumes the Result pattern is in place (the new sub-MCPs return Result[str, ErrorInfo] from invoke()).
  • data_structure_strengthening_20260606 assumes the Metadata family aliases are in place (the result types are referenced by name).
  • Both track specs have a §10 "Coordination with Pending Tracks" section that documents the post-tracks state and verifies it before proceeding.

Current state: Plan complete (f7b11f7f plan). Ready for execution.


4.4 data_structure_strengthening_20260606

Goal: Name the 430 anonymous dict[str, Any] / list[dict[...]] / Tuple[...] types in the codebase.

Architecture:

  • src/type_aliases.py: 10 TypeAlias definitions + 1 NamedTuple (FileItemsDiff).
    • Metadata (root), CommsLogEntry, CommsLog, HistoryMessage, History, FileItem, FileItems, ToolDefinition, ToolCall, CommsLogCallback
  • scripts/audit_weak_types.py (already committed 84fd9ac9): AST-based static analyzer. Finding dataclass; --json, --top N, --verbose modes. After this track: also --strict mode (CI gate; exits 1 if new weak sites are introduced).
  • scripts/generate_type_registry.py (Phase 2): AST-based registry generator. 3 modes — default (regenerate), --check (CI; exits 1 if drift), --diff (dry run). Writes docs/type_registry/<source_module>.md per source file.
  • docs/type_registry/: auto-generated per-source-file markdown references for the LLM to consult.

The data that drove the design:

  • 430 weak sites across 29 of 61 files in src/
  • 0 strong patterns currently (no TypeAlias, no NamedTuple, no pydantic.BaseModel in the relevant shapes)
  • 26 unique type strings after normalization
  • Top 4 unique strings = 86% of findings (list[dict[str, Any]], dict[str, Any], Dict[str, Any], List[Dict[str, Any]])
  • File distribution: ai_client.py (139), app_controller.py (86), models.py (51), api_hook_client.py (32), project_manager.py (20), aggregate.py (17) = 345 in 6 files; the rest in 23 lower-impact files

The "docs over TypedDict" decision (key user feedback mid-track):

  • Original draft proposed a follow-up track to convert aliases to TypedDicts.
  • User pushed back: pay the token cost (LLM reads the docs) instead of the upfront cost (designing TypedDict schemas for every type).
  • The docs/type_registry/ generator is the result: an LLM can cat docs/type_registry/ai_client.md to see the fields of every struct in src/ai_client.py without the code having to enforce the structure at runtime.
  • The 5-pattern structure (Nil sentinel, Zero-init, Fail-early, AND-over-OR, Side-channel errors) is documented in the styleguide.

Coordination:

  • This track's aliases compose with the Result[T] from data_oriented_error_handling_20260606: Result[FileItems], Result[CommsLogEntry], etc. are valid generics.
  • The audit script is the permanent CI gate for this convention. New dict[str, Any] in a PR fails --strict mode.

Current state: Plan complete (91475781 plan). Ready for execution.


4.5 mcp_architecture_refactor_20260606

Goal: Split the 2,205-line monolithic src/mcp_client.py (45 module-level functions) into a slim controller + 6 native sub-MCPs + 1 external sub-MCP.

Architecture:

  • src/mcp_client.py (modified, slim): SubMCP Protocol + MCPController class + module-level controller singleton + ALL_SUB_MCPS registration list + re-export shim from mcp_client_legacy.
  • src/mcp_client_legacy.py (NEW): the OLD mcp_client.py content. Re-exported for backward compat.
  • src/mcp_client_security.py (NEW): 3-layer security (Allowlist → Resolve → Validate) returning Result[Path].
  • src/mcp_file_io.py (9 tools), src/mcp_python.py (14), src/mcp_c.py (5), src/mcp_cpp.py (5), src/mcp_web.py (2), src/mcp_analysis.py (2): native sub-MCPs.
  • src/mcp_external.py: the existing ExternalMCPManager extracted; class name preserved as ExternalMCP for compat.

Naming convention (per user direction): mcp_<type>.py for native MCPs. The user explicitly said this; the convention is locked in.

Key design decisions:

  • Sub-MCP shape: class with name / description / tools (dict) / invoke() (returns Result[str, ErrorInfo]).
  • Registration mechanism: explicit controller.register(FileIOMCP()) at the bottom of mcp_client.py. New sub-MCP = create the file + add 2 lines to the registration. No magic, no auto-discovery.
  • Controller-level security: the 3-layer security runs BEFORE delegating to sub-MCPs. Sub-MCPs receive already-validated paths. Testable in isolation.
  • Dispatch inversion: the controller uses an inverted-dict self._tool_index[tool_name] -> sub_mcp for O(1) lookup. The current if/elif chain is O(n) per dispatch.
  • External MCP is NOT in ALL_SUB_MCPS — it's a sub-controller. The main controller delegates to it AFTER native sub-MCPs miss.

The "thin adapter" approach for v1:

  • Each sub-MCP's methods (e.g., read_file, py_get_skeleton) delegate to the corresponding function in mcp_client_legacy.py. This keeps the legacy module as the source of truth for the implementation; the new mcp_<type>.py is a thin adapter that adds the class shape, the security check, and the Result wrapping.
  • A future track can move the actual implementations into the sub-MCP files directly once the architecture is established. For v1, delegation is the safer path.

Backward compatibility:

  • src/mcp_client_legacy.py re-exports all 45+ old function names.
  • src/mcp_client.py is now a slim shim that imports from legacy.
  • The 4 existing test files (test_mcp_client_beads.py, test_mcp_config.py, test_mcp_perf_tool.py, test_mcp_ts_integration.py) and src/app_controller.py:61 (the direct mcp_client.py_get_symbol_info call) continue to work unchanged.

The DSL future (per user's notes on APL/K/Cosy):

  • The user shared a friend's idea: per-MCP compact dialects (like command line but more flexible) instead of JSON.
  • Acknowledged in the spec as out of scope for this track ("no time for that").
  • Documented as mcp_dsl_20260606 follow-up in spec §12.1.
  • The sub-MCP architecture is the natural unit to pair with a DSL emitter in the future.

Current state: Plan complete (cf01870b plan). Ready for execution.


5. The Audit & Data Foundation

The most data-grounded track is data_structure_strengthening_20260606. The audit that drove it is committed at 84fd9ac9:

File: scripts/audit_weak_types.py
Size: 281 lines
Modes: default (human-readable), --json, --top N, --verbose
Detection: AST-based; regex over ast.unparse() of type annotations
Patterns detected: 14 (Dict[str, Any], list[dict[...]], Tuple[...], Optional[...], assign-tuple-literal, ...)
Positive patterns detected: TypeAlias, NamedTuple, @dataclass, pydantic.BaseModel
Exit codes: 0 = informational, 1 = usage error

Pre-track findings (baseline):

  • 430 weak sites in 29 of 61 files
  • 0 strong patterns
  • 26 unique type strings
  • Top 4 unique strings = 86% of findings

Post-track target:

  • ~60 weak sites in the 23 lower-impact files (the 6 high-traffic files contribute 0)
  • 10 TypeAlias definitions + 1 NamedTuple in use
  • --strict mode + baseline file as permanent CI gate

This is the most measurable track in the planning session. Success = a concrete number drop in the audit count.


6. The Coordinate Picture (dependencies)

The 5 tracks form a dependency graph. The arrows are "blocks":

startup_speedup_20260606  (SHIPPED)
  ↓
  ├── test_batching_refactor_20260606  (planned)
  │
  ├── qwen_llama_grok_integration_20260606  (planned)
  │      ↓
  │      ├── data_oriented_error_handling_20260606  (planned)
  │      │      ↓
  │      │      ├── public_api_migration_20260606  (follow-up; not yet specced)
  │      │      └── type_registry_ci_20260606  (follow-up; not yet specced)
  │      │
  │      └── data_structure_strengthening_20260606  (planned)
  │             ↓
  │             └── type_registry_ci_20260606  (follow-up; not yet specced)
  │
  └── mcp_architecture_refactor_20260606  (planned; depends on data_oriented + data_structure tracks)
         ↓
         └── mcp_dsl_20260606  (follow-up; not yet specced)

Critical insight: mcp_architecture_refactor_20260606 depends on BOTH data_oriented_error_handling_20260606 (for Result) and data_structure_strengthening_20260606 (for the Metadata aliases). If the implementing agent executes tracks in arbitrary order, this dependency is broken.

The recommended execution order is the topological order: startup_speedup (done) → qwen_llama_grokdata_oriented_error_handling + data_structure_strengthening (in parallel) → mcp_architecture_refactortest_batching_refactor (no dependencies; can run anytime) → follow-up tracks.


7. Follow-up Tracks Already Planned (Not in This Session's 5)

Each track's spec §12.1 names a follow-up. Aggregated:

Follow-up Parent track Scope
public_api_migration_20260606 data_oriented_error_handling Remove deprecated ai_client.send(); migrate all callers (multi_agent_conductor, app_controller, ~50 tests) to send_result()
type_registry_ci_20260606 data_structure_strengthening Wire generate_type_registry.py --check into CI; add pre-commit hook; document per-track commit workflow
mcp_dsl_20260606 mcp_architecture_refactor Per-MCP compact dialect for tool calls (APL/K/Cosy-inspired); ~5x token reduction per call

All three are listed in conductor/tracks.md as [ ] placeholders. They should be sequenced AFTER the 5 main tracks ship. None are urgent; all are improvements.


These are tracks I identified during this session but didn't fully spec. They're ranked by what I think is most important.

8.1 Post-Tracks Documentation Synchronization (top pick)

Why: The 5 planned tracks add 10+ new modules and change the architecture significantly. The existing docs (docs/guide_*.md) were last updated in the 2026-06-02 comprehensive docs refresh — and are about to be more out of date than they are now. Stale docs are the #1 enemy of AI readability (an LLM reading guide_ai_client.md and finding it pre-dates Result/ErrorInfo will hallucinate the wrong shape).

Scope (1-2 phases):

  • Phase 1: Update all existing guides (guide_ai_client.md, guide_mcp_client.md, etc.) to reflect the post-tracks state.
  • Phase 2: Add cookbooks ("How to add a new sub-MCP", "How to add a new AI vendor", "How to add a new result type") + a docs/type_registry.md index.

Why first: Bounded and achievable. Closes the loop on all the planning work — each track ships a module; this track ships the docs that explain those modules.

8.2 Test Coverage Audit & Improvement (runner-up)

Why: The project has a stated >80% coverage target per conductor/workflow.md, but the actual current state is unknown. Under-tested areas are likely app_controller.py (4,153 lines; the orchestrator that touches everything) and multi_agent_conductor.py (the most complex control flow). The new modules from the 5 planned tracks each get unit tests in their respective tracks, but integration tests are sparse.

Scope (1-2 phases):

  • Phase 1: Run pytest --cov=src --cov-report=html; identify the bottom-10 modules by coverage; write tests to bring each to >80%.
  • Phase 2: Add a coverage threshold to CI (e.g., --cov-fail-under=80); add per-module coverage badges to docs/Readme.md.

8.3 Security Audit / Hardening

Why: The 3-layer MCP security model is solid, but there are adjacent concerns:

  • Command injection in run_powershell — the AI generates PowerShell commands; how is the risk of a malicious model call mitigated? The HITL dialog exists, but is it consistently applied?
  • Prompt injection — the AI sees file content, web search results, Beads queries. A malicious file could inject instructions that the AI then follows. How is this sanitized?
  • Sensitive data in logs — the comms_log records full API requests/responses. If a user includes an API key or password in a message, it ends up in the log. What's the redaction policy?

Scope (1-2 phases):

  • Phase 1: Threat model the AI tool-calling surface; document the existing mitigations; identify gaps.
  • Phase 2: Add log redaction for known secret patterns; add a "dangerous command" detector for run_powershell; add an "untrusted content" marker for content from external sources.

8.4 Dependency Hygiene

Why: pyproject.toml has a long dep list. No track for:

  • Version pinning strategy (caret vs tilde vs exact)
  • Deprecation monitoring (track when a vendor SDK announces EOL)
  • License audit (any GPL contamination?)
  • CVE scanning

This is a "track for the person who maintains the project 6 months from now."


9. Risks & Open Questions (Cross-Track)

9.1 Risks

Risk Likelihood Impact Mitigation
The implementing agent executes tracks in the wrong order, breaking the dependency chain (especially for mcp_architecture_refactor_20260606 which depends on the other two). Medium High (broken tests; confusing failures) The recommended execution order in §6 is explicit. The plan files note the dependencies in their "blocked_by" sections.
The 5 tracks add 10+ new files but the scripts/audit_main_thread_imports.py doesn't catch a heavy import in one of the new modules. Low Medium (regresses the startup_speedup invariant) Each new module's Phase 1 task includes an import-time check (uv run python -c "import time; ...").
A future contributor adds a new dict[str, Any] after the data_structure_strengthening track; the audit --strict mode catches it, but they're confused about why. Medium Low (process friction) The styleguide + the deprecation warning in --strict mode explain the rule.
The mcp_client_legacy.py shim becomes permanent and never gets removed. Medium Low (acceptable) The public_api_migration_20260606 follow-up (and any future MCP-API changes) is the natural place to remove the shim.
The DSL idea becomes a "we have to do it now" before the architecture track is done. Low Low The DSL is explicitly out of scope. The sub-MCP architecture is compatible with a future DSL layer.

9.2 Open questions for the next planning round

  • Where do the implementation agents' session notes / handoffs go? Each track has metadata.json + state.toml for the planning side. There's no equivalent for the implementation side. (The startup_speedup_20260606 track's recent commits 253e1798, 88fc42bb, 8c4791d0 suggest they do handoff via commit messages, but a structured format would be nice.)
  • What happens when a track's implementation diverges from the plan? Per conductor/workflow.md, "implementation differs from spec" is handled by updating the spec. But the plan files don't have a clear "deviations" section. Consider adding one to future plans.
  • How are plan review comments captured? The plan files are committed at cf01870b (and the others). But there's no conductor/plan_reviews/ directory. If the implementing agent has questions or disagreements, where do they go?

10. File Index

For the implementing agent (and any future planner), here's the canonical file index.

10.1 Conductor convention files (the project-level structure)

File Purpose
conductor/tracks.md Master track registry. Lists all tracks with their status ([ ] planned, [~] in progress, [x] done) and [track-created: <sha>] references.
conductor/workflow.md The project's TDD + per-track commit + git note workflow.
conductor/product-guidelines.md The project's design principles (1-space indent, 1 commit per task, type hints, etc.).
conductor/product.md The project's product vision and use cases.
conductor/tech-stack.md The project's tech stack.
conductor/code_styleguides/python.md Language-specific style guide.
conductor/code_styleguides/error_handling.md (created in data_oriented_error_handling) Data-Oriented Error Handling convention.
conductor/code_styleguides/type_aliases.md (created in data_structure_strengthening) Type Aliases convention.

10.2 The 5 new tracks (this session's planning output)

Track Spec SHA Plan SHA Files
test_batching_refactor_20260606 b7a97374 f7b11f7f spec.md, metadata.json, state.toml, plan.md
qwen_llama_grok_integration_20260606 7c1d597e (track init), 97daaff2 (consistency) b17cbbde spec.md, metadata.json, state.toml, plan.md
data_oriented_error_handling_20260606 494f68f9 (init), cbc3b075 (track + tracks.md), f7b11f7f (plan) f7b11f7f spec.md, metadata.json, state.toml, plan.md
data_structure_strengthening_20260606 ed42a97a (init), aba35f9f (registry), 432c7895 (risk) 91475781 spec.md, metadata.json, state.toml, plan.md
mcp_architecture_refactor_20260606 2720a894 (init), dd137df7 (backfill) cf01870b spec.md, metadata.json, state.toml, plan.md

10.3 The 5 new module families (what the tracks will create)

Module family Created by Files
Test batching test_batching_refactor_20260606 scripts/{test_categorizer,test_batcher,pytest_collection_order}.py, scripts/run_tests_batched.py, tests/test_categories.toml
Vendor capability matrix qwen_llama_grok_integration_20260606 src/{vendor_capabilities,openai_compatible,qwen_adapter}.py
Result types data_oriented_error_handling_20260606 src/result_types.py
Type aliases + registry data_structure_strengthening_20260606 src/type_aliases.py, scripts/generate_type_registry.py, docs/type_registry/
Sub-MCPs mcp_architecture_refactor_20260606 src/mcp_<type>.py (7 files), src/mcp_client_security.py, src/mcp_client_legacy.py

10.4 The audit script (data-driven decisions)

File Purpose
scripts/audit_weak_types.py (committed 84fd9ac9) AST analyzer that found the 430 weak sites driving data_structure_strengthening.

10.5 The startup_speedup predecessor

Track Status Key outputs
startup_speedup_20260606 SHIPPED (commits 12cec6ae, bb2ac6c9, 253e1798, 88fc42bb, 8c4791d0) _io_pool ThreadPoolExecutor; warmup mechanism; lazy SDK imports; scripts/audit_main_thread_imports.py CI gate

This is the predecessor for all 5 tracks — the lazy-SDK-import convention means the new modules can use from src.openai_compatible import send_openai_compatible at the top without paying the SDK import cost on the main thread.


11. Closing Notes

11.1 What the user achieved in this session

In a single multi-hour planning session, the user:

  • Approved 5 architectural refactor tracks end-to-end (brainstorming → spec → plan)
  • Made 3 major design decisions with significant impact: (1) the mcp_<type>.py naming convention, (2) the "docs over TypedDict" tradeoff, (3) the deprecation-not-removal of the public send() API
  • Brought in external inspiration: Ryan Fleury's data-oriented error handling, the user's friend's DSL idea
  • Established a pattern for data-grounded planning: every spec is preceded by an audit (or an inventory) that drives the design decisions

11.2 What the implementing agent inherits

  • 5 fully-specced + planned tracks, each with TDD task breakdown
  • A clear execution order (topological sort of the dependency graph)
  • ~25+ unit tests per track (pre-existing + new) that serve as regression coverage
  • A permanent audit + CI gate (scripts/audit_weak_types.py --strict) for the type-alias convention
  • Styleguides + product-guidelines + a new docs directory (docs/type_registry/) that serve as living documentation

11.3 What I would do differently if I could start over

  • Earlier on the data-oriented framing: The user brought Fleury's article mid-session (for the error-handling track). It would have been useful to surface the data-oriented design philosophy in the FIRST track (test_batching_refactor) and apply it there. Going forward, this is a thread to weave into every track.
  • The "richest context" claim is half-true: I have deep visibility into architecture and code quality concerns but little visibility into operational / production concerns (observability, telemetry, error rates in the field, user experience metrics). The recommended future tracks in §8 reflect this bias.

11.4 One last recommendation

The post-tracks documentation track (§8.1) is the single most important thing to do NEXT — after the 5 tracks ship, the docs are out of date. Plan it BEFORE the user starts working on the next big feature, so the codebase stays maintainable.