Single-session planning digest that captures: - The 5 tracks fully specced + planned (test_batching, qwen_llama_grok, data_oriented_error_handling, data_structure_strengthening, mcp_architecture_refactor) - Cross-cutting design themes (data-oriented, audit-driven, per-track commit + git note, out-of-scope-by-default) - The audit + data foundation (scripts/audit_weak_types.py; 430 -> 60 finding; 0 strong patterns; 26 unique type strings; 86% concentrated in 6 files) - The dependency graph + recommended execution order - Follow-up tracks already planned in spec §12.1 of each track - Recommended future tracks (post-tracks documentation is the top pick) - Risks, open questions, and a complete file index This is the kind of reference document that: - Future planners consult to understand the codebase's current state - The implementing agent uses to coordinate across tracks - The user reviews as a digest of the planning work Written in the project's docs/reports/ directory alongside the existing Phase 5 reports (PHASE5_STABILISATION_REPORT.md, MUTATION_MATRIX_PHASE5.md, etc.).
33 KiB
Planning Digest: 5-Track Architectural Refactor (2026-06-06)
Status: Planning complete; implementation in flight Author: Tier 2 Tech Lead (brainstorming + spec + plan for all 5 tracks) Date: 2026-06-06 Audience: Future planners, the implementing agent, the user (as a reference / digest)
1. Executive Summary
In a single planning session, 5 architectural refactor tracks were specced and planned end-to-end. Together they reshape the manual_slop codebase around three foundational design principles — data-oriented error handling (Fleury), data-oriented types (named, documented, generated), and modular MCP architecture (sub-MCPs by category). All 5 tracks share a common ancestor in the startup_speedup_20260606 track (already shipped as of 12cec6ae), which established the lazy-SDK-import convention the other tracks depend on.
| # | Track | Status | Phases | Key new files | What it does |
|---|---|---|---|---|---|
| 1 | test_batching_refactor_20260606 |
Planned | 4 | scripts/{test_categorizer,test_batcher,pytest_collection_order}.py |
Replaces alphabetical 4-at-a-time batching with tiered batching (Tier 1 unit + xdist, Tier 3 live_gui in one session, etc.) |
| 2 | qwen_llama_grok_integration_20260606 |
Planned | 6 | src/{vendor_capabilities,openai_compatible,qwen_adapter}.py |
Adds Qwen (DashScope), Llama (Ollama + OpenRouter + custom URL), Grok (xAI). Introduces the Vendor Capability Matrix. |
| 3 | data_oriented_error_handling_20260606 |
Planned | 5 | src/result_types.py |
Introduces Result[T], ErrorInfo, NilPath per Fleury. Removes ProviderError exception. Marks send() @deprecated; adds send_result(). |
| 4 | data_structure_strengthening_20260606 |
Planned | 2 | src/type_aliases.py, scripts/generate_type_registry.py |
Introduces 10 TypeAlias for the 430 anonymous dict[str, Any] / list[dict[...]] sites. Adds auto-generated docs/type_registry/. |
| 5 | mcp_architecture_refactor_20260606 |
Planned | 7 | src/mcp_<type>.py (7 files), src/mcp_client_security.py |
Splits 2,205-line mcp_client.py into slim controller + 6 native sub-MCPs + 1 external sub-MCP. |
Combined impact: ~5 new framework files; ~6 modified framework files; ~6 modified high-traffic files (for the type-aliases refactor); 1 monolithic file split into 9 focused files; 1 new CI gate script; 1 new docs directory.
2. Session Context
2.1 Workflow model
The user is operating in a planning / execution split mode:
- This session: Tier 2 Tech Lead (me) does brainstorming → spec → plan for each track. No code is written or executed.
- External session: Another agent does the implementation. It picks up each
plan.mdand executes task-by-task via the project's MMA tier system.
This split lets the user think strategically (planning) while the heavy lifting (executing) happens in parallel.
2.2 The pre-existing baseline
Before this session, the project had:
- 277 test files in
tests/(test_*.py+*_sim.py) - 53 src files (
src/*.py) - 14 deep-dive guides (
docs/guide_*.md) - The startup_speedup_20260606 track was in flight (Phase 6 complete per
253e1798; track SHIPPED per12cec6aein the same window as this planning session) - The test_batching_refactor_20260606 track had been planned (spec + plan were in the folder but execution hadn't started)
- Conductor convention was in place — every track has
spec.md+metadata.json+state.toml; thetracks.mdregistry lists all tracks with their[track-created: <sha>]references
2.3 What changed during this session
The user asked for 5 different refactor specs in sequence:
- Test batching refactor — already-planned track; I reviewed and committed
- Qwen/Llama/Grok vendors + capability matrix — new spec; multiple design questions resolved
- Data-oriented error handling (Fleury pattern) — new spec; user brought the article + friend's notes
- Data structure strengthening (type aliases + named tuples) — new spec; user proposed auto-generated docs over TypedDict migration
- MCP architecture refactor (sub-MCPs) — new spec; user proposed
mcp_<type>.pynaming + the DSL future idea
For each, I followed the brainstorming → spec → plan flow per the user's stated preference.
3. Cross-Cutting Design Themes
Five design themes run through all the tracks. Understanding them makes each track's individual decisions coherent.
3.1 Data-Oriented Design (Fleury / Acton / Lottes)
The user explicitly references this in two of the five tracks (data_oriented_error_handling_20260606 for errors; mcp_architecture_refactor_20260606 for module boundaries). The framing is:
- Errors are just cases, not special control-flow primitives. Use
Result[T]with side-channel error lists, not exceptions. - Algorithms on data, not methods on objects. The
MCPControlleris a data structure; sub-MCPs are data; the dispatch is a function from data to data. - Stable names, not types. Type aliases (
Metadata,FileItem, etc.) name data roles; they don't enforce structure (that's deferred to TypedDict if ever). - Shared code where possible; unique code only where vendor-specific. The
_send_<vendor>_result()functions inai_client.pyare thin boundary adapters; thesend_openai_compatible()helper is the shared algorithm.
3.2 Capability / Pattern / Convention as first-class docs
The user values explicit, discoverable conventions over implicit understanding. Each track introduces at least one canonical document:
conductor/code_styleguides/error_handling.md(Fleury patterns)conductor/code_styleguides/type_aliases.md(type alias conventions)docs/type_registry/(auto-generated per-source-file schema docs)conductor/code_styleguides/mcp_<type>.py(implicit, via the naming convention)
The product-guidelines.md is the umbrella; the styleguides are the detailed references. This pattern should be followed for any future track that introduces a new convention.
3.3 Audit + data-driven decisions
Two of the five tracks are data-grounded:
test_batching_refactor_20260606: addressed the actual problem (alphabetical 4-at-a-time batching) and explicitly designed the solution around the test categories the project already uses (Tier 1 unit, Tier 2 mock_app, Tier 3 live_gui, etc.).data_structure_strengthening_20260606: drove by thescripts/audit_weak_types.pyfindings (430 weak sites; 86% concentrated in 6 high-traffic files; 0 strong patterns; 26 unique type strings; top 4 = 86% of findings).
The audit data is the source of truth. The track's success criterion is a measurable drop in the audit count (430 → ~60 = 86% reduction).
3.4 Process: per-track commit + git note + checkpoint
Every plan follows the same template:
- Per-task commit: 1 commit per Red-Green-Refactor step
- Per-checkpoint git note:
git notes add -m "..."summarizing what the phase delivered - Per-checkpoint state.toml update:
current_phaseadvanced;checkpointshafilled in
This is a feature of the project's conductor/workflow.md and is consistently applied. The next planner / implementer should follow the same template.
3.5 Out-of-scope-by-default; follow-up tracks for the next round
Each of the 5 tracks explicitly defers work to follow-up tracks. The follow-ups are documented in each spec's §12.1:
public_api_migration_20260606— removes deprecatedsend()(from data_oriented_error_handling)type_registry_ci_20260606— wiresgenerate_type_registry.py --checkinto CI (from data_structure_strengthening)mcp_dsl_20260606— per-MCP compact DSL for tool calls (from mcp_architecture_refactor)typed_dict_migration_20260606— convert most-used aliases toTypedDict(initially planned; later replaced by the docs approach; kept as a future option)
These follow-ups are listed in conductor/tracks.md as [ ] placeholders (item 0f etc.). They should be sequenced AFTER the 5 main tracks ship.
4. The 5 Tracks in Detail
4.1 test_batching_refactor_20260606
Goal: Replace alphabetical 4-at-a-time batching with tiered batching that respects fixture-class boundaries.
Architecture:
scripts/test_categorizer.py: AST-based classifier that determines each test file'sFixtureClass(UNIT, MOCK_APP, LIVE_GUI, HEADLESS, OPT_IN, PERFORMANCE) and itsbatch_group(e.g.,core,gui,mma).scripts/test_batcher.py: Pure scheduler.plan(records, options) -> list[Batch]deterministically produces batches.scripts/pytest_collection_order.py: Conftest-loaded plugin for the per-test order control (opt-in per file).scripts/run_tests_batched.py: Modified CLI orchestrator with--tiers,--include-opt-in,--plan,--auditmodes.
Key decisions:
- Tier 3 (live_gui) is one pytest invocation, not many. This is THE single biggest runtime savings (15s startup amortized).
- Tier 1 (unit) uses pytest-xdist for parallelism.
- Tier 0 (opt-in) is gated on BOTH env var AND CLI flag (defense-in-depth: setting the env var alone shouldn't accidentally enable docker tests).
- Hybrid classification: auto-infer from filename + AST fixture scan; hand-curated
tests/test_categories.tomloverrides for cross-cutting and ambiguous files.
What's NOT done: The script does NOT modify test files or fixtures; it only categorizes and batches. New tests get sensible defaults automatically.
Current state: Plan complete (7fdab705 spec, f7b11f7f plan). Ready for execution.
4.2 qwen_llama_grok_integration_20260606
Goal: Add first-class support for Qwen, Llama, Grok. Introduce the Vendor Capability Matrix.
Architecture:
src/vendor_capabilities.py:VendorCapabilitiesdataclass,_REGISTRYpopulated per-(vendor, model).src/openai_compatible.py: sharedsend_openai_compatible()helper (data-oriented design — operates on normalized data).src/qwen_adapter.py: DashScope-specific tool format translation + error classification.
Key decisions:
- Naming convention:
_send_<vendor>_result()returningResult[str, ErrorInfo](8 vendors: Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Llama, Grok). - Capability Matrix v1: 7 capabilities — vision, tool_calling, caching, streaming, model_discovery, context_window, cost_tracking. Audio and server-side code_execution deferred to a future track.
- UX adaptation: 9 UI elements read the matrix (screenshot button, tools toggle, cache panel, stream progress, fetch models button, token budget max, cost panel).
- OpenAI-compatible at the SDK boundary keeps raising; the new
_send_<vendor>_result()functions catch and convert toErrorInfo. Per Fleury: "exceptions are reserved for the SDK boundary."
Coordination with startup_speedup_20260606: Qwen's DashScope SDK adds a new import; the audit script scripts/audit_main_thread_imports.py ensures the import is gated to a worker thread, not the main thread. Verified at the baseline in Phase 1 of the track.
Current state: Plan complete (b17cbbde plan). Ready for execution.
4.3 data_oriented_error_handling_20260606
Goal: Introduce Ryan Fleury's "errors are just cases" framework as a project convention.
Architecture:
src/result_types.py:ErrorKindenum,ErrorInfodataclass,Result[T]generic,NilPath+NilRAGStatesentinel singletons.src/mcp_client.py(the data_oriented refactor for MCP): (p, err) tuples →Result[Path];assert p is not None→ nil-sentinel.src/ai_client.py:ProviderErrorexception REMOVED;_classify_<vendor>_error()returnsErrorInfo;_send_<vendor>()renamed to_send_<vendor>_result()returningResult[str].src/rag_engine.py: methods returnResultinstead of raising.
Key decisions:
- Internal-only refactor for the public API.
_send_<vendor>_result()is renamed + retuned. The publicsend()is preserved, marked@typing_extensions.deprecated; the newsend_result()returnsResult[str]. The actual breaking change happens in the follow-uppublic_api_migration_20260606track. ProviderErroris FULLY REMOVED, not kept as a thin internal exception. Per Fleury, exceptions are for the SDK boundary only; once the boundary converts toErrorInfo, no exception is needed.- Deprecation warning emitted in tests:
tests/conftest.pyaddsfilterwarnings("ignore::DeprecationWarning:src.ai_client")during the transition.
Coordination with pending tracks:
mcp_architecture_refactor_20260606assumes theResultpattern is in place (the new sub-MCPs returnResult[str, ErrorInfo]frominvoke()).data_structure_strengthening_20260606assumes theMetadatafamily aliases are in place (the result types are referenced by name).- Both track specs have a §10 "Coordination with Pending Tracks" section that documents the post-tracks state and verifies it before proceeding.
Current state: Plan complete (f7b11f7f plan). Ready for execution.
4.4 data_structure_strengthening_20260606
Goal: Name the 430 anonymous dict[str, Any] / list[dict[...]] / Tuple[...] types in the codebase.
Architecture:
src/type_aliases.py: 10TypeAliasdefinitions + 1NamedTuple(FileItemsDiff).Metadata(root),CommsLogEntry,CommsLog,HistoryMessage,History,FileItem,FileItems,ToolDefinition,ToolCall,CommsLogCallback
scripts/audit_weak_types.py(already committed84fd9ac9): AST-based static analyzer.Findingdataclass;--json,--top N,--verbosemodes. After this track: also--strictmode (CI gate; exits 1 if new weak sites are introduced).scripts/generate_type_registry.py(Phase 2): AST-based registry generator. 3 modes — default (regenerate),--check(CI; exits 1 if drift),--diff(dry run). Writesdocs/type_registry/<source_module>.mdper source file.docs/type_registry/: auto-generated per-source-file markdown references for the LLM to consult.
The data that drove the design:
- 430 weak sites across 29 of 61 files in
src/ - 0 strong patterns currently (no
TypeAlias, noNamedTuple, nopydantic.BaseModelin the relevant shapes) - 26 unique type strings after normalization
- Top 4 unique strings = 86% of findings (
list[dict[str, Any]],dict[str, Any],Dict[str, Any],List[Dict[str, Any]]) - File distribution: ai_client.py (139), app_controller.py (86), models.py (51), api_hook_client.py (32), project_manager.py (20), aggregate.py (17) = 345 in 6 files; the rest in 23 lower-impact files
The "docs over TypedDict" decision (key user feedback mid-track):
- Original draft proposed a follow-up track to convert aliases to
TypedDicts. - User pushed back: pay the token cost (LLM reads the docs) instead of the upfront cost (designing
TypedDictschemas for every type). - The
docs/type_registry/generator is the result: an LLM cancat docs/type_registry/ai_client.mdto see the fields of every struct insrc/ai_client.pywithout the code having to enforce the structure at runtime. - The 5-pattern structure (Nil sentinel, Zero-init, Fail-early, AND-over-OR, Side-channel errors) is documented in the styleguide.
Coordination:
- This track's aliases compose with the
Result[T]fromdata_oriented_error_handling_20260606:Result[FileItems],Result[CommsLogEntry], etc. are valid generics. - The audit script is the permanent CI gate for this convention. New
dict[str, Any]in a PR fails--strictmode.
Current state: Plan complete (91475781 plan). Ready for execution.
4.5 mcp_architecture_refactor_20260606
Goal: Split the 2,205-line monolithic src/mcp_client.py (45 module-level functions) into a slim controller + 6 native sub-MCPs + 1 external sub-MCP.
Architecture:
src/mcp_client.py(modified, slim):SubMCPProtocol +MCPControllerclass + module-levelcontrollersingleton +ALL_SUB_MCPSregistration list + re-export shim frommcp_client_legacy.src/mcp_client_legacy.py(NEW): the OLDmcp_client.pycontent. Re-exported for backward compat.src/mcp_client_security.py(NEW): 3-layer security (Allowlist → Resolve → Validate) returningResult[Path].src/mcp_file_io.py(9 tools),src/mcp_python.py(14),src/mcp_c.py(5),src/mcp_cpp.py(5),src/mcp_web.py(2),src/mcp_analysis.py(2): native sub-MCPs.src/mcp_external.py: the existingExternalMCPManagerextracted; class name preserved asExternalMCPfor compat.
Naming convention (per user direction): mcp_<type>.py for native MCPs. The user explicitly said this; the convention is locked in.
Key design decisions:
- Sub-MCP shape: class with
name/description/tools(dict) /invoke()(returnsResult[str, ErrorInfo]). - Registration mechanism: explicit
controller.register(FileIOMCP())at the bottom ofmcp_client.py. New sub-MCP = create the file + add 2 lines to the registration. No magic, no auto-discovery. - Controller-level security: the 3-layer security runs BEFORE delegating to sub-MCPs. Sub-MCPs receive already-validated paths. Testable in isolation.
- Dispatch inversion: the controller uses an inverted-dict
self._tool_index[tool_name] -> sub_mcpfor O(1) lookup. The current if/elif chain is O(n) per dispatch. - External MCP is NOT in
ALL_SUB_MCPS— it's a sub-controller. The main controller delegates to it AFTER native sub-MCPs miss.
The "thin adapter" approach for v1:
- Each sub-MCP's methods (e.g.,
read_file,py_get_skeleton) delegate to the corresponding function inmcp_client_legacy.py. This keeps the legacy module as the source of truth for the implementation; the newmcp_<type>.pyis a thin adapter that adds the class shape, the security check, and theResultwrapping. - A future track can move the actual implementations into the sub-MCP files directly once the architecture is established. For v1, delegation is the safer path.
Backward compatibility:
src/mcp_client_legacy.pyre-exports all 45+ old function names.src/mcp_client.pyis now a slim shim that imports from legacy.- The 4 existing test files (
test_mcp_client_beads.py,test_mcp_config.py,test_mcp_perf_tool.py,test_mcp_ts_integration.py) andsrc/app_controller.py:61(the directmcp_client.py_get_symbol_infocall) continue to work unchanged.
The DSL future (per user's notes on APL/K/Cosy):
- The user shared a friend's idea: per-MCP compact dialects (like command line but more flexible) instead of JSON.
- Acknowledged in the spec as out of scope for this track ("no time for that").
- Documented as
mcp_dsl_20260606follow-up in spec §12.1. - The sub-MCP architecture is the natural unit to pair with a DSL emitter in the future.
Current state: Plan complete (cf01870b plan). Ready for execution.
5. The Audit & Data Foundation
The most data-grounded track is data_structure_strengthening_20260606. The audit that drove it is committed at 84fd9ac9:
File: scripts/audit_weak_types.py
Size: 281 lines
Modes: default (human-readable), --json, --top N, --verbose
Detection: AST-based; regex over ast.unparse() of type annotations
Patterns detected: 14 (Dict[str, Any], list[dict[...]], Tuple[...], Optional[...], assign-tuple-literal, ...)
Positive patterns detected: TypeAlias, NamedTuple, @dataclass, pydantic.BaseModel
Exit codes: 0 = informational, 1 = usage error
Pre-track findings (baseline):
- 430 weak sites in 29 of 61 files
- 0 strong patterns
- 26 unique type strings
- Top 4 unique strings = 86% of findings
Post-track target:
- ~60 weak sites in the 23 lower-impact files (the 6 high-traffic files contribute 0)
- 10
TypeAliasdefinitions + 1NamedTuplein use --strictmode + baseline file as permanent CI gate
This is the most measurable track in the planning session. Success = a concrete number drop in the audit count.
6. The Coordinate Picture (dependencies)
The 5 tracks form a dependency graph. The arrows are "blocks":
startup_speedup_20260606 (SHIPPED)
↓
├── test_batching_refactor_20260606 (planned)
│
├── qwen_llama_grok_integration_20260606 (planned)
│ ↓
│ ├── data_oriented_error_handling_20260606 (planned)
│ │ ↓
│ │ ├── public_api_migration_20260606 (follow-up; not yet specced)
│ │ └── type_registry_ci_20260606 (follow-up; not yet specced)
│ │
│ └── data_structure_strengthening_20260606 (planned)
│ ↓
│ └── type_registry_ci_20260606 (follow-up; not yet specced)
│
└── mcp_architecture_refactor_20260606 (planned; depends on data_oriented + data_structure tracks)
↓
└── mcp_dsl_20260606 (follow-up; not yet specced)
Critical insight: mcp_architecture_refactor_20260606 depends on BOTH data_oriented_error_handling_20260606 (for Result) and data_structure_strengthening_20260606 (for the Metadata aliases). If the implementing agent executes tracks in arbitrary order, this dependency is broken.
The recommended execution order is the topological order: startup_speedup (done) → qwen_llama_grok → data_oriented_error_handling + data_structure_strengthening (in parallel) → mcp_architecture_refactor → test_batching_refactor (no dependencies; can run anytime) → follow-up tracks.
7. Follow-up Tracks Already Planned (Not in This Session's 5)
Each track's spec §12.1 names a follow-up. Aggregated:
| Follow-up | Parent track | Scope |
|---|---|---|
public_api_migration_20260606 |
data_oriented_error_handling | Remove deprecated ai_client.send(); migrate all callers (multi_agent_conductor, app_controller, ~50 tests) to send_result() |
type_registry_ci_20260606 |
data_structure_strengthening | Wire generate_type_registry.py --check into CI; add pre-commit hook; document per-track commit workflow |
mcp_dsl_20260606 |
mcp_architecture_refactor | Per-MCP compact dialect for tool calls (APL/K/Cosy-inspired); ~5x token reduction per call |
All three are listed in conductor/tracks.md as [ ] placeholders. They should be sequenced AFTER the 5 main tracks ship. None are urgent; all are improvements.
8. Recommended Future Tracks (Beyond What's Planned)
These are tracks I identified during this session but didn't fully spec. They're ranked by what I think is most important.
8.1 Post-Tracks Documentation Synchronization (top pick)
Why: The 5 planned tracks add 10+ new modules and change the architecture significantly. The existing docs (docs/guide_*.md) were last updated in the 2026-06-02 comprehensive docs refresh — and are about to be more out of date than they are now. Stale docs are the #1 enemy of AI readability (an LLM reading guide_ai_client.md and finding it pre-dates Result/ErrorInfo will hallucinate the wrong shape).
Scope (1-2 phases):
- Phase 1: Update all existing guides (
guide_ai_client.md,guide_mcp_client.md, etc.) to reflect the post-tracks state. - Phase 2: Add cookbooks ("How to add a new sub-MCP", "How to add a new AI vendor", "How to add a new result type") + a
docs/type_registry.mdindex.
Why first: Bounded and achievable. Closes the loop on all the planning work — each track ships a module; this track ships the docs that explain those modules.
8.2 Test Coverage Audit & Improvement (runner-up)
Why: The project has a stated >80% coverage target per conductor/workflow.md, but the actual current state is unknown. Under-tested areas are likely app_controller.py (4,153 lines; the orchestrator that touches everything) and multi_agent_conductor.py (the most complex control flow). The new modules from the 5 planned tracks each get unit tests in their respective tracks, but integration tests are sparse.
Scope (1-2 phases):
- Phase 1: Run
pytest --cov=src --cov-report=html; identify the bottom-10 modules by coverage; write tests to bring each to >80%. - Phase 2: Add a coverage threshold to CI (e.g.,
--cov-fail-under=80); add per-module coverage badges todocs/Readme.md.
8.3 Security Audit / Hardening
Why: The 3-layer MCP security model is solid, but there are adjacent concerns:
- Command injection in
run_powershell— the AI generates PowerShell commands; how is the risk of a malicious model call mitigated? The HITL dialog exists, but is it consistently applied? - Prompt injection — the AI sees file content, web search results, Beads queries. A malicious file could inject instructions that the AI then follows. How is this sanitized?
- Sensitive data in logs — the
comms_logrecords full API requests/responses. If a user includes an API key or password in a message, it ends up in the log. What's the redaction policy?
Scope (1-2 phases):
- Phase 1: Threat model the AI tool-calling surface; document the existing mitigations; identify gaps.
- Phase 2: Add log redaction for known secret patterns; add a "dangerous command" detector for
run_powershell; add an "untrusted content" marker for content from external sources.
8.4 Dependency Hygiene
Why: pyproject.toml has a long dep list. No track for:
- Version pinning strategy (caret vs tilde vs exact)
- Deprecation monitoring (track when a vendor SDK announces EOL)
- License audit (any GPL contamination?)
- CVE scanning
This is a "track for the person who maintains the project 6 months from now."
9. Risks & Open Questions (Cross-Track)
9.1 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
The implementing agent executes tracks in the wrong order, breaking the dependency chain (especially for mcp_architecture_refactor_20260606 which depends on the other two). |
Medium | High (broken tests; confusing failures) | The recommended execution order in §6 is explicit. The plan files note the dependencies in their "blocked_by" sections. |
The 5 tracks add 10+ new files but the scripts/audit_main_thread_imports.py doesn't catch a heavy import in one of the new modules. |
Low | Medium (regresses the startup_speedup invariant) | Each new module's Phase 1 task includes an import-time check (uv run python -c "import time; ..."). |
A future contributor adds a new dict[str, Any] after the data_structure_strengthening track; the audit --strict mode catches it, but they're confused about why. |
Medium | Low (process friction) | The styleguide + the deprecation warning in --strict mode explain the rule. |
The mcp_client_legacy.py shim becomes permanent and never gets removed. |
Medium | Low (acceptable) | The public_api_migration_20260606 follow-up (and any future MCP-API changes) is the natural place to remove the shim. |
| The DSL idea becomes a "we have to do it now" before the architecture track is done. | Low | Low | The DSL is explicitly out of scope. The sub-MCP architecture is compatible with a future DSL layer. |
9.2 Open questions for the next planning round
- Where do the implementation agents' session notes / handoffs go? Each track has
metadata.json+state.tomlfor the planning side. There's no equivalent for the implementation side. (Thestartup_speedup_20260606track's recent commits253e1798,88fc42bb,8c4791d0suggest they do handoff via commit messages, but a structured format would be nice.) - What happens when a track's implementation diverges from the plan? Per
conductor/workflow.md, "implementation differs from spec" is handled by updating the spec. But the plan files don't have a clear "deviations" section. Consider adding one to future plans. - How are plan review comments captured? The plan files are committed at
cf01870b(and the others). But there's noconductor/plan_reviews/directory. If the implementing agent has questions or disagreements, where do they go?
10. File Index
For the implementing agent (and any future planner), here's the canonical file index.
10.1 Conductor convention files (the project-level structure)
| File | Purpose |
|---|---|
conductor/tracks.md |
Master track registry. Lists all tracks with their status ([ ] planned, [~] in progress, [x] done) and [track-created: <sha>] references. |
conductor/workflow.md |
The project's TDD + per-track commit + git note workflow. |
conductor/product-guidelines.md |
The project's design principles (1-space indent, 1 commit per task, type hints, etc.). |
conductor/product.md |
The project's product vision and use cases. |
conductor/tech-stack.md |
The project's tech stack. |
conductor/code_styleguides/python.md |
Language-specific style guide. |
conductor/code_styleguides/error_handling.md |
(created in data_oriented_error_handling) Data-Oriented Error Handling convention. |
conductor/code_styleguides/type_aliases.md |
(created in data_structure_strengthening) Type Aliases convention. |
10.2 The 5 new tracks (this session's planning output)
| Track | Spec SHA | Plan SHA | Files |
|---|---|---|---|
test_batching_refactor_20260606 |
b7a97374 |
f7b11f7f |
spec.md, metadata.json, state.toml, plan.md |
qwen_llama_grok_integration_20260606 |
7c1d597e (track init), 97daaff2 (consistency) |
b17cbbde |
spec.md, metadata.json, state.toml, plan.md |
data_oriented_error_handling_20260606 |
494f68f9 (init), cbc3b075 (track + tracks.md), f7b11f7f (plan) |
f7b11f7f |
spec.md, metadata.json, state.toml, plan.md |
data_structure_strengthening_20260606 |
ed42a97a (init), aba35f9f (registry), 432c7895 (risk) |
91475781 |
spec.md, metadata.json, state.toml, plan.md |
mcp_architecture_refactor_20260606 |
2720a894 (init), dd137df7 (backfill) |
cf01870b |
spec.md, metadata.json, state.toml, plan.md |
10.3 The 5 new module families (what the tracks will create)
| Module family | Created by | Files |
|---|---|---|
| Test batching | test_batching_refactor_20260606 |
scripts/{test_categorizer,test_batcher,pytest_collection_order}.py, scripts/run_tests_batched.py, tests/test_categories.toml |
| Vendor capability matrix | qwen_llama_grok_integration_20260606 |
src/{vendor_capabilities,openai_compatible,qwen_adapter}.py |
| Result types | data_oriented_error_handling_20260606 |
src/result_types.py |
| Type aliases + registry | data_structure_strengthening_20260606 |
src/type_aliases.py, scripts/generate_type_registry.py, docs/type_registry/ |
| Sub-MCPs | mcp_architecture_refactor_20260606 |
src/mcp_<type>.py (7 files), src/mcp_client_security.py, src/mcp_client_legacy.py |
10.4 The audit script (data-driven decisions)
| File | Purpose |
|---|---|
scripts/audit_weak_types.py (committed 84fd9ac9) |
AST analyzer that found the 430 weak sites driving data_structure_strengthening. |
10.5 The startup_speedup predecessor
| Track | Status | Key outputs |
|---|---|---|
startup_speedup_20260606 |
SHIPPED (commits 12cec6ae, bb2ac6c9, 253e1798, 88fc42bb, 8c4791d0) |
_io_pool ThreadPoolExecutor; warmup mechanism; lazy SDK imports; scripts/audit_main_thread_imports.py CI gate |
This is the predecessor for all 5 tracks — the lazy-SDK-import convention means the new modules can use from src.openai_compatible import send_openai_compatible at the top without paying the SDK import cost on the main thread.
11. Closing Notes
11.1 What the user achieved in this session
In a single multi-hour planning session, the user:
- Approved 5 architectural refactor tracks end-to-end (brainstorming → spec → plan)
- Made 3 major design decisions with significant impact: (1) the
mcp_<type>.pynaming convention, (2) the "docs over TypedDict" tradeoff, (3) the deprecation-not-removal of the publicsend()API - Brought in external inspiration: Ryan Fleury's data-oriented error handling, the user's friend's DSL idea
- Established a pattern for data-grounded planning: every spec is preceded by an audit (or an inventory) that drives the design decisions
11.2 What the implementing agent inherits
- 5 fully-specced + planned tracks, each with TDD task breakdown
- A clear execution order (topological sort of the dependency graph)
- ~25+ unit tests per track (pre-existing + new) that serve as regression coverage
- A permanent audit + CI gate (
scripts/audit_weak_types.py --strict) for the type-alias convention - Styleguides + product-guidelines + a new docs directory (
docs/type_registry/) that serve as living documentation
11.3 What I would do differently if I could start over
- Earlier on the data-oriented framing: The user brought Fleury's article mid-session (for the error-handling track). It would have been useful to surface the data-oriented design philosophy in the FIRST track (test_batching_refactor) and apply it there. Going forward, this is a thread to weave into every track.
- The "richest context" claim is half-true: I have deep visibility into architecture and code quality concerns but little visibility into operational / production concerns (observability, telemetry, error rates in the field, user experience metrics). The recommended future tracks in §8 reflect this bias.
11.4 One last recommendation
The post-tracks documentation track (§8.1) is the single most important thing to do NEXT — after the 5 tracks ship, the docs are out of date. Plan it BEFORE the user starts working on the next big feature, so the codebase stays maintainable.