conductor: Create 3 MVP tracks with surgical specs from full codebase analysis

Three new tracks identified by analyzing product.md requirements against actual codebase state using 1M-context Opus with all architecture docs loaded: 1. mma_pipeline_fix_20260301 (P0, blocker): - Diagnoses why Tier 3 worker output never reaches mma_streams in GUI - Identifies 4 root cause candidates: positional arg ordering, asyncio.Queue thread-safety violation, ai_client.reset_session() side effects, token stats stub returning empty dict - 2 phases, 6 tasks with exact line references 2. simulation_hardening_20260301 (P1, depends on pipeline fix): - Addresses 3 documented issues from robust_live_simulation session compression - Mock triggers wrong approval popup, popup state desync, approval ambiguity - 3 phases, 9 tasks including standalone mock test suite 3. context_token_viz_20260301 (P2): - Builds UI for product.md primary use case #2 'Context & Memory Management' - Backend already complete (get_history_bleed_stats, 140 lines) - Token budget bar, proportion breakdown, trimming preview, cache status - 3 phases, 10 tasks Execution order: pipeline_fix -> simulation_hardening -> gui_ux (parallel w/ token_viz) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-01 09:58:34 -05:00
parent d93f650c3a
commit 0d2b6049d1
9 changed files with 194 additions and 0 deletions
@@ -0,0 +1,9 @@
 {
    "track_id": "context_token_viz_20260301",
    "description": "Build UI for context window utilization, token breakdown, trimming preview, and cache status.",
    "type": "feature",
    "status": "new",
    "priority": "P2",
    "created_at": "2026-03-01T15:50:00Z",
    "updated_at": "2026-03-01T15:50:00Z"
 }
@@ -0,0 +1,23 @@
 # Implementation Plan: Context & Token Visualization
 Architecture reference: [docs/guide_architecture.md](../../docs/guide_architecture.md) — AI Client section
 ## Phase 1: Token Budget Display
 - [ ] Task 1.1: Add a new method `_render_token_budget_panel(self)` in `gui_2.py`. Place it in the Provider panel area (after `_render_provider_panel`, gui_2.py:2485-2542), or as a new collapsible section within the provider panel. Call `ai_client.get_history_bleed_stats(self._last_stable_md)` — need to cache `self._last_stable_md` from the last `_do_generate()` call (gui_2.py:1408-1425, the `stable_md` return value). Store the result in `self._token_stats: dict = {}`, refreshed on each `_do_generate` call and on provider/model switch.
 - [ ] Task 1.2: Render the utilization bar. Use `imgui.progress_bar(stats['utilization_pct'] / 100, ImVec2(-1, 0), f"{stats['utilization_pct']:.1f}%")`. Color-code via `imgui.push_style_color(imgui.Col_.plot_histogram, ...)`: green if <50%, yellow if 50-80%, red if >80%. Below the bar, show: `f"{stats['estimated_prompt_tokens']:,} / {stats['max_prompt_tokens']:,} tokens ({stats['headroom_tokens']:,} remaining)"`.
 - [ ] Task 1.3: Render the proportion breakdown as a 3-row table: System (`system_tokens`), Tools (`tools_tokens`), History (`history_tokens`). Each row shows token count and percentage of total. Use `imgui.begin_table("token_breakdown", 3)` with columns: Component, Tokens, Pct.
 - [ ] Task 1.4: Write tests verifying `_render_token_budget_panel` calls `get_history_bleed_stats` and handles the empty dict case (when no provider is configured).
 ## Phase 2: Trimming Preview & Cache Status
 - [ ] Task 2.1: When `stats.get('would_trim')` is True, render a warning: `imgui.text_colored(ImVec4(1,0.3,0,1), "WARNING: Next call will trim history")`. Below it, show `f"Trimmable turns: {stats['trimmable_turns']}"`. If `stats` contains per-message breakdown, render the first 3 trimmable messages with their role and token count in a compact list.
 - [ ] Task 2.2: Add Gemini cache status display. Read `ai_client._gemini_cache` (check `is not None`), `ai_client._gemini_cache_created_at`, and `ai_client._GEMINI_CACHE_TTL`. If cache exists, show: `"Gemini Cache: ACTIVE | Age: {age_seconds}s / {ttl}s | Renews at: {ttl * 0.9:.0f}s"`. If not, show `"Gemini Cache: INACTIVE"`. Guard with `if ai_client._provider == "gemini":`.
 - [ ] Task 2.3: Add Anthropic cache hint. When provider is `"anthropic"`, show: `"Anthropic: 4-breakpoint ephemeral caching (auto-managed)"` with the number of history turns and whether the latest response used cache reads (check last comms log entry for `cache_read_input_tokens`).
 - [ ] Task 2.4: Write tests for trimming warning visibility and cache status display.
 ## Phase 3: Auto-Refresh & Integration
 - [ ] Task 3.1: Hook `_token_stats` refresh into three trigger points: (a) after `_do_generate()` completes — cache `stable_md` and call `get_history_bleed_stats`; (b) after provider/model switch in `current_provider.setter` and `current_model.setter` — clear and re-fetch; (c) after each `handle_ai_response` in `_process_pending_gui_tasks` — refresh stats since history grew. For (c), use a flag `self._token_stats_dirty = True` and refresh in the next frame's render call to avoid calling the stats function too frequently.
 - [ ] Task 3.2: Add the token budget panel to the Hook API. Extend `/api/gui/mma_status` (or add a new `/api/gui/token_stats` endpoint) to expose `_token_stats` for simulation verification. This allows tests to assert on token utilization levels.
 - [ ] Task 3.3: Conductor - User Manual Verification 'Phase 3: Auto-Refresh & Integration' (Protocol in workflow.md)
@@ -0,0 +1,42 @@
 # Track Specification: Context & Token Visualization
 ## Overview
 product.md lists "Context & Memory Management" as primary use case #2: "Better visualization and management of token usage and context memory, allowing developers to optimize prompt limits manually." The backend already computes everything needed via `ai_client.get_history_bleed_stats()` (ai_client.py:1657-1796, 140 lines). This track builds the UI to expose it.
 ## Current State
 ### Backend (already implemented)
 `get_history_bleed_stats(md_content=None) -> dict[str, Any]` returns:
 - `provider`: Active provider name
 - `model`: Active model name
 - `history_turns`: Number of conversation turns
 - `estimated_prompt_tokens`: Total estimated prompt tokens (system + history + tools)
 - `max_prompt_tokens`: Provider's max (180K Anthropic, 900K Gemini)
 - `utilization_pct`: `estimated / max * 100`
 - `headroom_tokens`: Tokens remaining before trimming kicks in
 - `would_trim`: Boolean — whether the next call would trigger history trimming
 - `trimmable_turns`: Number of turns that could be dropped
 - `system_tokens`: Tokens consumed by system prompt + context
 - `tools_tokens`: Tokens consumed by tool definitions
 - `history_tokens`: Tokens consumed by conversation history
 - Per-message breakdown with role, token estimate, and whether it contains tool use
 ### GUI (missing)
 No UI exists to display any of this. The user has zero visibility into:
 - How close they are to hitting the context window limit
 - What proportion is system prompt vs history vs tools
 - Which messages would be trimmed and when
 - Whether Gemini's server-side cache is active and how large it is
 ## Goals
 1. **Token Budget Bar**: A prominent progress bar showing context utilization (green < 50%, yellow 50-80%, red > 80%).
 2. **Breakdown Panel**: Stacked bar or table showing system/tools/history proportions.
 3. **Trimming Preview**: When `would_trim` is true, show which turns would be dropped.
 4. **Cache Status**: For Gemini, show whether `_gemini_cache` exists, its size in tokens, and TTL remaining.
 5. **Refresh**: Auto-refresh on provider/model switch and after each AI response.
 ## Architecture Reference
 - AI client state: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "AI Client: Multi-Provider Architecture"
 - Gemini cache: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "Gemini Cache Strategy"
 - Anthropic cache: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "Anthropic Cache Strategy (4-Breakpoint System)"
 - Frame-sync: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see `_process_pending_gui_tasks` for how to safely read backend state from GUI thread
@@ -0,0 +1,10 @@
 {
    "track_id": "mma_pipeline_fix_20260301",
    "description": "Fix Tier 3 worker responses not reaching mma_streams in GUI, fix token usage tracking stubs.",
    "type": "fix",
    "status": "new",
    "priority": "P0",
    "blocks": ["comprehensive_gui_ux_20260228", "simulation_hardening_20260301"],
    "created_at": "2026-03-01T15:45:00Z",
    "updated_at": "2026-03-01T15:45:00Z"
 }
@@ -0,0 +1,18 @@
 # Implementation Plan: MMA Pipeline Fix & Worker Stream Verification
 ## Phase 1: Diagnose & Fix Worker Stream Pipeline
 - [ ] Task 1.1: Add diagnostic logging to `run_worker_lifecycle` (multi_agent_conductor.py:280-290). Before the `_queue_put` call, add `print(f"[MMA] Pushing Tier 3 response for {ticket.id}, loop={'present' if loop else 'NONE'}, stream_id={response_payload['stream_id']}")`. Also add a `print` inside the `except Exception as e` block that currently silently swallows errors. This will reveal whether (a) the function reaches the push point, (b) `loop` is passed correctly, (c) any exceptions are being swallowed.
 - [ ] Task 1.2: Remove the unsafe `else` branch in `run_worker_lifecycle` (multi_agent_conductor.py:289-290) that calls `event_queue._queue.put_nowait()`. `asyncio.Queue` is NOT thread-safe from non-event-loop threads. The `else` branch should either raise an error (`raise RuntimeError("loop is required for thread-safe event queue access")`) or use a fallback that IS thread-safe. Same fix needed in `confirm_execution` (line 156) and `confirm_spawn` (line 183).
 - [ ] Task 1.3: Verify the `run_in_executor` positional argument order at `multi_agent_conductor.py:118-127` matches `run_worker_lifecycle`'s signature exactly: `(ticket, context, context_files, event_queue, engine, md_content, loop)`. The signature at line 207 is: `(ticket, context, context_files=None, event_queue=None, engine=None, md_content="", loop=None)`. Positional args must be in this exact order. If any are swapped, fix the call site.
 - [ ] Task 1.4: Write a unit test that creates a mock `AsyncEventQueue` and `asyncio.AbstractEventLoop`, calls `run_worker_lifecycle` with a mock `ai_client.send` (returning a fixed string), and verifies the `("response", {...})` event was pushed with the correct `stream_id` format `"Tier 3 (Worker): {ticket.id}"`.
 ## Phase 2: Fix Token Usage Tracking
 - [ ] Task 2.1: In `run_worker_lifecycle` (multi_agent_conductor.py:295-298), the `stats = {}` stub produces zero token counts. Replace with `stats = ai_client.get_history_bleed_stats()` which returns a dict containing `"total_input_tokens"` and `"total_output_tokens"` (see ai_client.py:1657-1796). Extract the relevant fields and update `engine.tier_usage["Tier 3"]`. If `get_history_bleed_stats` is too heavy, use the simpler approach: after `ai_client.send()`, read the last comms log entry from `ai_client.get_comms_log()[-1]` which contains `payload.usage` with token counts.
 - [ ] Task 2.2: Similarly fix Tier 1 and Tier 2 token tracking. In `_cb_plan_epic` (gui_2.py:1985-2010) and wherever Tier 2 calls happen, ensure `mma_tier_usage` is updated with actual token counts from comms log entries.
 ## Phase 3: End-to-End Verification
 - [ ] Task 3.1: Update `tests/visual_sim_mma_v2.py` Stage 8 to assert that `mma_streams` contains a key matching `"Tier 3"` with non-empty content after a full mock MMA run. If this already passes with the fixes from Phase 1, mark as verified. If not, trace the specific failure point using the diagnostic logging from Task 1.1.
 - [ ] Task 3.2: Conductor - User Manual Verification 'Phase 3: End-to-End Verification' (Protocol in workflow.md)
@@ -0,0 +1,26 @@
 # Track Specification: MMA Pipeline Fix & Worker Stream Verification
 ## Overview
 The MMA pipeline has a verified code path from `run_worker_lifecycle` → `_queue_put("response", ...)` → `_process_event_queue` → `_pending_gui_tasks("handle_ai_response")` → `mma_streams[stream_id] = text`. However, the robust_live_simulation track's session compression (2026-02-28) documented that Tier 3 worker output never appears in `mma_streams` during actual GUI operation. The simulation only ever sees `'Tier 1'` in `mma_streams` keys.
 This track diagnoses and fixes the pipeline break, then verifies end-to-end that worker output flows from `ai_client.send()` through to the GUI's `mma_streams` dict.
 ## Root Cause Candidates (from code analysis)
 1. **`run_in_executor` positional arg ordering**: `run_worker_lifecycle` has 7 parameters. The call at `multi_agent_conductor.py:118-127` passes them positionally. If the order is wrong, `loop` could be `None` and `_queue_put` would silently fail (the `if loop:` branch is skipped, falling to `event_queue._queue.put_nowait()` which may not work from a thread-pool thread because `asyncio.Queue.put_nowait` is not thread-safe when called from outside the event loop).
 2. **`asyncio.Queue` thread safety**: `_queue_put` uses `asyncio.run_coroutine_threadsafe()` which IS thread-safe. But the `else` branch (`event_queue._queue.put_nowait(...)`) is NOT — `asyncio.Queue` is NOT thread-safe for cross-thread access. If `loop` is `None`, this branch silently corrupts or drops the event.
 3. **`ai_client.reset_session()` side effects**: Called at the start of `run_worker_lifecycle`, this resets the global `_gemini_cli_adapter.session_id = None`. If the adapter is shared state and the GUI's Tier 2 call is still in-flight, this could corrupt the provider state.
 4. **Token stats stub**: `engine.tier_usage` update uses `stats = {}` (empty dict, commented "ai_client.get_token_stats() is not available"), so `prompt_tokens` and `candidates_tokens` are always 0. Not a stream bug but a data bug.
 ## Goals
 1. Fix Tier 3 worker responses reaching `mma_streams` in the GUI.
 2. Fix token usage tracking for Tier 3 workers.
 3. Verify via `ApiHookClient.get_mma_status()` that `mma_streams` contains Tier 3 output after a mock MMA run.
 ## Architecture Reference
 - Threading model: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "Cross-Thread Data Structures" and "Pattern A: AsyncEventQueue"
 - Worker lifecycle: [docs/guide_mma.md](../../docs/guide_mma.md) — see "Tier 3: Worker Lifecycle"
 - Frame-sync: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "Frame-Sync Mechanism" action catalog (`handle_ai_response` with `stream_id`)
@@ -0,0 +1,10 @@
 {
    "track_id": "simulation_hardening_20260301",
    "description": "Stabilize visual_sim_mma_v2.py and mock_gemini_cli.py for reliable end-to-end MMA simulation.",
    "type": "fix",
    "status": "new",
    "priority": "P1",
    "depends_on": ["mma_pipeline_fix_20260301"],
    "created_at": "2026-03-01T15:45:00Z",
    "updated_at": "2026-03-01T15:45:00Z"
 }
@@ -0,0 +1,22 @@
 # Implementation Plan: Simulation Hardening
 Depends on: `mma_pipeline_fix_20260301`
 Architecture reference: [docs/guide_simulations.md](../../docs/guide_simulations.md)
 ## Phase 1: Mock Provider Cleanup
 - [ ] Task 1.1: Rewrite `tests/mock_gemini_cli.py` response routing to be explicit about which prompts trigger tool calls vs plain text. Current default emits `read_file` tool calls which trigger `_pending_ask_dialog` (wrong approval type). Fix: only emit tool calls when the prompt contains `'"role": "tool"'` (already handled as the post-tool-call response path). The default path (Tier 3 worker prompts, epic planning, sprint planning) should return plain text only. Remove any remaining magic keyword matching that isn't necessary. Verify by checking that the mock's output for an epic planning prompt does NOT contain any `function_call` JSON.
 - [ ] Task 1.2: Add a new response route to `mock_gemini_cli.py` for Tier 2 Tech Lead prompts. Detect via `'PATH: Sprint Planning'` or `'generate the implementation tickets'` in the prompt. Return a well-formed JSON array of 2-3 mock tickets with proper `depends_on` relationships. Ensure the JSON is parseable by `conductor_tech_lead.py`'s multi-layer extraction (test by feeding the mock output through `json.loads()`).
 - [ ] Task 1.3: Write a standalone test (`tests/test_mock_gemini_cli.py`) that invokes the mock script via `subprocess.run()` with various stdin prompts and verifies: (a) epic prompt → Track JSON, no tool calls; (b) sprint prompt → Ticket JSON, no tool calls; (c) worker prompt → plain text, no tool calls; (d) tool-result prompt → plain text response.
 ## Phase 2: Simulation Stability
 - [ ] Task 2.1: In `tests/visual_sim_mma_v2.py`, add a `time.sleep(0.5)` after every `client.click()` call that triggers a state change (Accept, Load Track, Approve). This gives the GUI thread one frame to process `_pending_gui_tasks` before the next `get_mma_status()` poll. The current rapid-fire click-then-poll pattern races against the frame-sync mechanism.
 - [ ] Task 2.2: Add explicit `client.wait_for_value()` calls after critical state transitions instead of raw polling loops. For example, after `client.click('btn_mma_accept_tracks')`, use `client.wait_for_value('proposed_tracks_count', 0, timeout=10)` (may need to add a `proposed_tracks_count` field to the `/api/gui/mma_status` response, or just poll until `proposed_tracks` is empty/absent).
 - [ ] Task 2.3: Add a test timeout decorator or `pytest.mark.timeout(300)` to the main test function to prevent infinite hangs in CI. Currently the test can hang forever if any polling loop never satisfies its condition.
 ## Phase 3: End-to-End Verification
 - [ ] Task 3.1: Run the full `tests/visual_sim_mma_v2.py` against the live GUI with mock provider. All 8 stages must pass. Document any remaining failures with exact error output and polling state at time of failure.
 - [ ] Task 3.2: Verify that after the full simulation run, `client.get_mma_status()` returns: (a) `mma_status` is `'done'` or tickets are all `'completed'`; (b) `mma_streams` contains at least one key with `'Tier 3'`; (c) `mma_tier_usage` shows non-zero values for at least Tier 3.
 - [ ] Task 3.3: Conductor - User Manual Verification 'Phase 3: End-to-End Verification' (Protocol in workflow.md)
@@ -0,0 +1,34 @@
 # Track Specification: Simulation Hardening
 ## Overview
 The `robust_live_simulation_verification` track is marked complete but its session compression documents three unresolved issues: (1) brittle mock that triggers the wrong approval popup, (2) popup state desynchronization after "Accept" clicks, (3) Tier 3 output never appearing in `mma_streams` (fixed by `mma_pipeline_fix` track). This track stabilizes the simulation framework so it reliably passes end-to-end.
 ## Prerequisites
 - `mma_pipeline_fix_20260301` MUST be completed first (fixes Tier 3 stream plumbing).
 ## Current Issues (from session compression 2026-02-28)
 ### Issue 1: Mock Triggers Wrong Approval Popup
 `mock_gemini_cli.py` defaults to emitting a `read_file` tool call, which triggers the general tool approval popup (`_pending_ask_dialog`) instead of the MMA spawn popup (`_pending_mma_spawn`). The test expects the spawn popup and times out.
 **Root cause**: The mock's default response path doesn't distinguish between MMA orchestration prompts and Tier 3 worker prompts. It needs to NOT emit tool calls for orchestration-level prompts (Tier 1/2), only for worker-level prompts where tool use is expected.
 ### Issue 2: Popup State Desynchronization
 After clicking "Accept" on the track proposal modal, `_show_track_proposal_modal` is set to `False` but the test still sees the popup as active. The hook API's `mma_status` returns stale `proposed_tracks` data.
 **Root cause**: `_cb_accept_tracks` (gui_2.py:2012-2045) processes tracks and clears `proposed_tracks`, but this runs on the GUI thread. The `ApiHookClient.get_mma_status()` reads via the GUI trampoline pattern, but there may be a frame delay before the state updates are visible.
 ### Issue 3: Approval Type Ambiguity
 The test polling loop auto-approves `pending_approval` but can't distinguish between tool approval (`_pending_ask_dialog`), MMA step approval (`_pending_mma_approval`), and spawn approval (`_pending_mma_spawn`). The simulation needs explicit handling for each type.
 **Already resolved in code**: `get_mma_status` now returns separate `pending_tool_approval`, `pending_mma_step_approval`, `pending_mma_spawn_approval` booleans. The test in `visual_sim_mma_v2.py` already checks these individually. The fix is in making the mock not trigger unexpected approval types.
 ## Goals
 1. Make `tests/visual_sim_mma_v2.py` pass reliably against the live GUI.
 2. Clean up mock_gemini_cli.py to be deterministic and not trigger spurious approvals.
 3. Add retry/timeout resilience to polling loops.
 ## Architecture Reference
 - Simulation patterns: [docs/guide_simulations.md](../../docs/guide_simulations.md)
 - Hook API endpoints: [docs/guide_tools.md](../../docs/guide_tools.md) — see `/api/gui/mma_status` response fields
 - HITL mechanism: [docs/guide_architecture.md](../../docs/guide_architecture.md) — see "The Execution Clutch"