From 983538aa8b5b83fdc452cccfcdec537d454460b6 Mon Sep 17 00:00:00 2001
From: Ed_ <edwardgz@gmail.com>
Date: Thu, 5 Mar 2026 00:31:55 -0500
Subject: [PATCH] reports and potential new track

---
 TASKS.md                                      |    6 +-
 .../index.md                                  |    3 +
 .../metadata.json                             |    9 +
 .../plan.md                                   |   33 +
 .../report.md                                 | 2303 +++++++++++++++++
 .../report_claude.md                          |  562 ++++
 .../spec.md                                   |   96 +
 7 files changed, 3011 insertions(+), 1 deletion(-)
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/index.md
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/plan.md
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/report.md
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md
 create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/spec.md

diff --git a/TASKS.md b/TASKS.md
index 5cab1a6..7d40b2f 100644
--- a/TASKS.md
+++ b/TASKS.md
@@ -79,4 +79,8 @@
 **Goal:** Elevate Tier 4 from a log summarizer to an auto-patcher. When a verification test fails, Tier 4 generates a `.patch` file. The GUI intercepts this and presents a side-by-side Diff Viewer. The user clicks "Apply Patch" to instantly resume the pipeline.
 
 ### 5. Transitioning to a Native Orchestrator
-**Goal:** Absorb the Conductor extension entirely into the core application. Manual Slop should natively read/write `plan.md`, manage the `metadata.json`, and orchestrate the MMA tiers in pure Python, removing the dependency on external CLI shell executions (`mma_exec.py`).
\ No newline at end of file
+**Goal:** Absorb the Conductor extension entirely into the core application. Manual Slop should natively read/write `plan.md`, manage the `metadata.json`, and orchestrate the MMA tiers in pure Python, removing the dependency on external CLI shell executions (`mma_exec.py`).
+### 10. 	est_architecture_integrity_audit_20260304 (Planned)
+- **Status:** Initialized
+- **Priority:** High
+- **Goal:** Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. Documented by GLM-4.7 via full skeletal analysis of src/, tests/, and simulation/ directories.
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/index.md b/conductor/tracks/test_architecture_integrity_audit_20260304/index.md
new file mode 100644
index 0000000..3612454
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/index.md
@@ -0,0 +1,3 @@
+# Test Architecture Integrity & Simulation Audit
+
+[Specification](spec.md) | [Plan](plan.md)
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json b/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json
new file mode 100644
index 0000000..d86b79b
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json
@@ -0,0 +1,9 @@
+{
+  "id": "test_architecture_integrity_audit_20260304"`,
+  "name": "Test Architecture Integrity & Simulation Audit"`,
+  "status": "planned",
+  "created_at": "2026-03-04T00:00:00Z",
+  "updated_at": "2026-03-04T00:00:00Z",
+  "type": "audit",
+  "severity": "high"
+}
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md b/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md
new file mode 100644
index 0000000..22707b0
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md
@@ -0,0 +1,33 @@
+# Implementation Plan
+
+## Phase 1: Documentation (Planning)
+Focus: Create comprehensive audit documentation with severity ratings
+
+- [ ] Task 1.1: Document all identified false positive risks with severity matrix
+- [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis
+- [ ] Task 1.3: Create mapping of coverage gaps to test categories
+- [ ] Task 1.4: Provide concrete false positive examples
+- [ ] Task 1.5: Provide concrete simulation miss examples
+- [ ] Task 1.6: Prioritize recommendations by impact/effort matrix
+
+## Phase 2: Review & Validation (Research)
+Focus: Peer review of audit findings
+
+- [ ] Task 2.1: Review existing tracks for overlap with this audit
+- [ ] Task 2.2: Validate severity ratings against actual bug history
+- [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract
+- [ ] Task 2.4: Identify which gaps should be addressed in which future track
+
+## Phase 3: Track Finalization
+Focus: Prepare for downstream implementation tracks
+
+- [ ] Task 3.1: Create prioritized backlog of implementation recommendations
+- [ ] Task 3.2: Map recommendations to appropriate future tracks
+- [ ] Task 3.3: Document dependencies between this audit and subsequent work
+
+## Phase 4: User Manual Verification (Protocol in workflow.md)
+Focus: Human review of audit findings
+
+- [ ] Task 4.1: Review severity matrix for accuracy
+- [ ] Task 4.2: Validate concrete examples against real-world scenarios
+- [ ] Task 4.3: Approve recommendations for implementation
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/report.md b/conductor/tracks/test_architecture_integrity_audit_20260304/report.md
new file mode 100644
index 0000000..a45a3f6
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/report.md
@@ -0,0 +1,2303 @@
+# Manual Slop Testing & Simulation Architecture Analysis
+
+**Author:** GLM-4.7
+
+**Analysis Date:** 2026-03-04
+
+**Derivation Methodology:**
+1. Performed full skeletal summary of all Python files in `./src/` (28 modules, ~400KB total)
+2. Performed full skeletal summary of all Python test files in `./tests/` (100+ test files)
+3. Performed full skeletal summary of all simulation files in `./simulation/` (9 scripts)
+4. Read all architecture documentation in `./docs/` (guide_simulations.md, guide_mma.md, guide_architecture.md)
+5. Analyzed test infrastructure patterns against structural testing contract requirements
+6. Identified gaps between contract requirements and actual implementation
+7. Mapped potential false positive scenarios across mock provider, auto-approval, and assertion patterns
+8. Evaluated simulation framework fidelity against real UX requirements
+9. Cross-referenced findings with existing tracks in `conductor/tracks/`
+
+---
+
+## Executive Summary
+
+**Critical Finding:** The test suite has significant **false positive risks** and **simulation gaps** that could mask bugs and UX issues. The current approach is a **low-resolution emulator** rather than a high-fidelity puppet of real user experience.
+
+**Key Issues Identified:**
+1. Mock provider always returns success ? tests pass even if real LLM would fail
+2. Auto-approval of all HITL gates ? tests never verify approval UX works
+3. Substring-based assertions ? tests pass even if output is malformed
+4. No state validation ? tests check existence, not correctness
+5. No negative path testing ? error handling never verified
+6. No visual verification ? UI rendering bugs never caught
+
+---
+
+## Part 1: Module Architecture Analysis (src/)
+### Complete File Inventory and Functionality
+
+**Core Infrastructure (3 modules, ~23KB)**
+
+#### 1. events.py (2.6KB)
+**Purpose:** Decoupled event system for cross-module communication
+
+**Classes:**
+- `EventEmitter`: Synchronous pub/sub pattern
+  - `on(event_name, callback)`: Register callback for event
+  - `emit(event_name, *args, **kwargs)`: Execute all callbacks
+  - No thread safety - assumes single-threaded usage
+  
+- `AsyncEventQueue`: Async queue-based communication
+  - `put(event_name, payload)`: Enqueue event
+  - `get()`: Retrieve event tuple (event_name, payload)
+  - Uses `asyncio.Queue` internally
+  
+- `UserRequestEvent`: Typed payload for AI requests
+  - Fields: prompt, stable_md, file_items, disc_text, base_dir
+  - `to_dict()`: Serialize to dictionary format
+
+**Usage Pattern:** Used by ai_client for lifecycle hooks, by App for event routing, by multi_agent_conductor for state broadcasting.
+
+#### 2. models.py (6.9KB)
+**Purpose:** Core data structures for MMA orchestration
+
+**Data Classes:**
+- `Ticket`: Atomic unit of work
+  - Fields: id, description, status (todo|in_progress|completed|blocked), assigned_to, target_file, context_requirements, depends_on, blocked_reason, step_mode, retry_count
+  - Methods: `mark_blocked(reason)`, `mark_complete()`, `get(key, default)`, `to_dict()`, `from_dict(data)`
+  
+- `Track`: Collection of tickets with shared goal
+  - Fields: id, description, tickets list
+  - Methods: `get_executable_tickets()` - returns todo tickets with all deps completed
+  
+- `WorkerContext`: Context for Tier 3 agents
+  - Fields: ticket_id, model_name, messages list
+  
+- `TrackState`: Persistence schema for track state
+  - Fields: metadata (Metadata object), discussion list, tasks list (Ticket objects)
+  - Methods: `to_dict()`, `from_dict(data)`
+  
+- `Metadata`: Track metadata
+  - Fields: id, name, status, created_at, updated_at
+  - Methods: `to_dict()`, `from_dict(data)`
+
+**Constants:**
+- `DISC_ROLES`: ["User", "AI", "Vendor API", "System", "Reasoning"]
+- `AGENT_TOOL_NAMES`: List of 26 MCP tool names
+- `CONFIG_PATH`: Path to config.toml
+
+**Usage Pattern:** Used throughout MMA system for state management and persistence.
+
+#### 3. api_hook_client.py (9.2KB)
+**Purpose:** IPC client for hook server communication
+
+**Class: `ApiHookClient`**
+- `__init__(base_url, max_retries, retry_delay)`: Initialize client
+- `wait_for_server(timeout)`: Poll /status until ready
+- `_make_request(method, endpoint, data, timeout)`: Internal request wrapper with retry logic
+- `get_status()`: Check health of hook server
+- `get_project()`: Retrieve project data
+- `post_project(project_data)`: Update project data
+- `get_session()`: Retrieve session data
+- `get_mma_status()`: Retrieve current MMA status (track, tickets, tier, streams)
+- `push_event(event_type, payload)`: Push event to GUI'"'" AsyncEventQueue
+- `get_performance()`: Retrieve UI performance metrics
+- `post_session(session_entries)`: Update session data
+- `post_gui(gui_data)`: Update GUI state
+- `select_tab(tab_bar, tab)`: Switch to specific tab
+- `select_list_item(listbox, item_value)`: Select item in listbox
+- `set_value(item, value)`: Set GUI field value
+- `get_value(item)`: Get GUI field value
+- `get_text_value(item_tag)`: Get string representation of field
+- `get_node_status(node_tag)`: Get DAG node status
+- `click(item, *args, **kwargs)`: Simulate button click
+- `get_indicator_state(tag)`: Check indicator visibility
+- `get_events()`: Fetch and clear event queue
+- `wait_for_event(event_type, timeout)`: Poll for specific event
+- `wait_for_value(item, expected, timeout)`: Poll until value matches
+- `reset_session()`: Simulate clicking Reset Session
+- `request_confirmation(tool_name, args)`: Blocking approval request
+
+**Usage Pattern:** Primary interface for test automation via Hook API.
+
+#### 4. api_hooks.py (14.9KB)
+**Purpose:** HTTP server for exposing internal state to external automation
+
+**Classes:**
+- `HookServerInstance`: Custom ThreadingHTTPServer carrying App reference
+  - Manages server lifecycle
+  - Delegates to HookHandler for request processing
+  
+- `HookHandler(BaseHTTPRequestHandler)`: Handles HTTP requests
+  - `do_GET()`: Handle GET requests
+    - `/status`: Return server health and basic state
+    - `/project`: Return project configuration
+    - `/session`: Return session data
+    - `/gui/state`: Return full GUI state (for test automation)
+    - `/diagnostics`: Return performance metrics
+  - `do_POST()`: Handle POST requests
+    - `/project`: Update project configuration
+    - `/session`: Update session data
+    - `/api/ask`: Trigger AI request with approval
+    - `/gui`: Execute GUI actions (set_value, click, select_tab, etc.)
+    - `/resolve_pending_action`: Resolve approval dialogs
+
+**Usage Pattern:** Provides REST API for headless mode and test automation.
+
+#### 5. performance_monitor.py (4.2KB)
+**Purpose:** Telemetry for UI performance monitoring
+
+**Class: `PerformanceMonitor`**
+- `__init__()`: Initialize metrics and start CPU monitoring thread
+- `_monitor_cpu()`: Background thread sampling CPU usage every 1s
+- `start_frame()`: Record frame start time
+- `record_input_event()`: Track time since last input event
+- `start_component(name)`: Start timing a UI component
+- `end_component(name)`: End timing a UI component
+- `end_frame()`: Calculate FPS and frame time
+- `_check_alerts()`: Check for performance degradation (low FPS, high frame time, high input lag)
+- `get_metrics()`: Return dict with fps, frame_time_ms, cpu_percent, input_lag_ms
+- `stop()`: Stop monitoring thread
+
+**Usage Pattern:** Integrated into App for real-time performance tracking.
+
+### AI Integration Layer (2 modules, ~75KB)
+
+#### 6. ai_client.py (70.6KB)
+**Purpose:** Multi-provider LLM abstraction
+
+**Module-Level State:**
+- `_provider`: "gemini" | "anthropic" | "deepseek" | "gemini_cli"
+- `_model`: Current model name
+- `_temperature`, `_max_tokens`: Model parameters
+- `_history_trunc_limit`: Character limit for history truncation (8000)
+- `events`: EventEmitter for lifecycle hooks
+- `_send_lock`: threading.Lock to serialize send() calls
+- `MAX_TOOL_ROUNDS`: Maximum tool call loop iterations (10)
+- `_MAX_TOOL_OUTPUT_BYTES`: Cumulative tool output budget (500KB)
+- `_ANTHROPIC_CHUNK_SIZE`: Max chars per text block (120,000)
+- `_ANTHROPIC_MAX_PROMPT_TOKENS`: Anthropic limit (180,000)
+- `_GEMINI_MAX_INPUT_TOKENS`: Gemini limit (900,000)
+- `_GEMINI_CACHE_TTL`: Cache TTL in seconds (3600, rebuilt at 90%)
+
+**Per-Provider Clients:**
+- `_gemini_client`: genai.Client (SDK-managed stateful chat)
+- `_gemini_chat`: Holds history internally
+- `_gemini_cache`: Server-side CachedContent
+- `_gemini_cache_md_hash`: Hash for cache invalidation
+- `_gemini_cache_created_at`: Cache creation timestamp
+- `_anthropic_client`: anthropic.Anthropic (client-managed history)
+- `_anthropic_history`: List of message dicts (client-managed)
+- `_anthropic_history_lock`: threading.Lock
+- `_deepseek_client`: Raw requests HTTP client
+- `_deepseek_history`: List of message dicts (client-managed)
+- `_deepseek_history_lock`: threading.Lock
+- `_gemini_cli_adapter`: GeminiCliAdapter (subprocess wrapper)
+
+**Callback Injections:**
+- `confirm_and_run_callback`: Set by gui.py/app_controller.py - called when AI wants to run command
+- `comms_log_callback`: Set by gui.py/app_controller.py - called when comms entry appended
+- `tool_log_callback`: Set by gui.py/app_controller.py - called when tool call completes
+- `current_tier`: Set by caller tiers - used for comms tagging
+
+**Key Functions:**
+- `set_model_params(temp, max_tok, trunc_limit)`: Update model parameters
+- `get_history_trunc_limit()`, `set_history_trunc_limit(val)`: Get/set trunc limit
+- `cleanup()`: Destroy all provider clients
+- `reset_session()`: Reset all provider state
+- `get_gemini_cache_stats()`: Return cache stats dict
+- `list_models(provider)`: List available models for provider
+- `set_provider(provider, model)`: Switch provider and model
+- `get_provider()`: Return current provider
+- `set_agent_tools(tools)`: Set enabled tools
+- `_build_anthropic_tools()`: Build tool schemas for Anthropic
+- `_get_anthropic_tools()`: Get cached Anthropic tools
+- `_gemini_tool_declaration()`: Build tool declaration for Gemini
+- `_run_script(script, base_dir, qa_callback)`: Execute PowerShell via shell_runner
+- `_truncate_tool_output(output)`: Truncate tool output at char limit
+- `_reread_file_items(file_items)`: Check mtimes and rebuild changed file diffs
+- `_build_file_context_text(file_items)`: Build context text from file items
+- `_build_file_diff_text(changed_items)`: Build diff text for changed files
+- `_content_block_to_dict(block)`: Convert content block to dict
+- `_ensure_gemini_client()`: Initialize Gemini client if needed
+- `_get_gemini_history_list(chat)`: Extract history list from chat
+- `_send_gemini(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Gemini SDK
+- `_send_gemini_cli(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Gemini CLI adapter
+- `_send_anthropic(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Anthropic SDK
+- `_estimate_message_tokens(msg)`: Estimate token count for message
+- `_invalidate_token_estimate(msg)`: Clear token estimate cache
+- `_estimate_prompt_tokens(system_blocks, history)`: Estimate total prompt tokens
+- `_strip_stale_file_refreshes(history)`: Remove old [FILES UPDATED] blocks
+- `_trim_anthropic_history(system_blocks, history)`: Trim Anthropic history at 180k limit
+- `_ensure_anthropic_client()`: Initialize Anthropic client if needed
+- `_chunk_text(text, chunk_size)`: Split text into chunks
+- `_build_chunked_context_blocks(md_content)`: Build chunked context for Anthropic
+- `_strip_cache_controls(history)`: Remove cache control headers
+- `_add_history_cache_breakpoint(history)`: Add cache breakpoint
+- `_repair_anthropic_history(history)`: Repair malformed Anthropic history
+- `_send_deepseek(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via DeepSeek HTTP
+- `run_tier4_analysis(stderr)`: Call Tier 4 QA agent with error
+- `get_token_stats(md_content)`: Return token usage statistics
+- `send(md_content, user_message, base_dir, file_items, discussion_history, stream, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Main dispatcher
+- `_add_bleed_derived(d, sys_tok, tool_tok)`: Add derived stats to comms entry
+- `get_history_bleed_stats(md_content)`: Return history bleed statistics
+
+**Usage Pattern:** Unified interface for all LLM providers, handles tool calling, streaming, history management, and caching.
+
+#### 7. gemini_cli_adapter.py (4.6KB)
+**Purpose:** Subprocess bridge for Gemini CLI
+
+**Class: `GeminiCliAdapter`
+- `__init__(binary_path)`: Initialize with path to gemini binary
+- `send(message, safety_settings, system_instruction, model, stream_callback)`: 
+  - Send message to CLI via stdin
+  - Parse streaming JSON output line-by-line
+  - Handle tool_use events (continue reading until final result)
+  - Extract usage metadata from result event
+  - Call stream_callback for each message chunk
+- `count_tokens(contents)`: Estimate tokens (4 chars/token)
+- `__enter__`, `__exit__`: Context manager interface (not implemented)
+
+**Usage Pattern:** Bridges GUI to Gemini CLI subprocess for headless provider support.
+### MCP Tools Bridge (1 module, ~48KB)
+
+#### 8. mcp_client.py (48.2KB)
+**Purpose:** MCP-style file context tools with security restrictions
+
+**Module-Level State:**
+- `_allowed_paths`: Set of resolved absolute Path objects (files or dirs)
+- `_base_dirs`: Set of resolved absolute Path dirs that act as roots
+- `_primary_base_dir`: Path to primary base dir
+- `MUTATING_TOOLS`: frozenset of mutating tool names
+- `perf_monitor_callback`: Optional callback for performance metrics
+- `MCP_TOOL_SPECS`: List of 26 tool specification dicts
+
+**Module-Level Functions:**
+- `configure(file_items, extra_base_dirs)`: Build allowlist from file_items
+- `_is_allowed(path)`: Check if path is within allowlist
+- `_resolve_and_check(raw_path)`: Resolve path and verify it passes allowlist
+- `read_file(path)`: Read file content
+- `list_directory(path)`: List directory entries
+- `search_files(path, pattern)`: Search files by glob pattern
+- `get_file_summary(path)`: Return heuristic summary via summarize.py
+- `py_get_skeleton(path)`: Get AST skeleton of Python file
+- `py_get_code_outline(path)`: Get code outline with line ranges
+- `get_file_slice(path, start_line, end_line)`: Read specific line range
+- `set_file_slice(path, start_line, end_line, new_content)`: Replace line range
+- `edit_file(path, old_string, new_string, replace_all)`: String replacement
+- `_get_symbol_node(tree, name)`: Find AST node by name
+- `py_get_definition(path, name)`: Get full source of class/function
+- `py_update_definition(path, name, new_content)`: Update definition via AST
+- `py_get_signature(path, name)`: Get function/method signature
+- `py_set_signature(path, name, new_signature)`: Replace signature
+- `py_get_class_summary(path, name)`: Get class method signatures
+- `py_get_var_declaration(path, name)`: Get variable declaration
+- `py_set_var_declaration(path, name, new_declaration)`: Replace variable declaration
+- `get_git_diff(path, base_rev, head_rev)`: Get git diff
+- `py_find_usages(path, name)`: Find symbol usages
+- `py_get_imports(path)`: Get file imports
+- `py_check_syntax(path)`: Check Python syntax
+- `py_get_hierarchy(path, class_name)`: Find subclasses
+- `py_get_docstring(path, name)`: Get docstring
+- `get_tree(path, max_depth)`: Get directory tree
+- `web_search(query)`: Search via DuckDuckGo
+- `fetch_url(url)`: Fetch URL content
+- `get_ui_performance()`: Get UI performance metrics
+- `dispatch(tool_name, tool_input)`: Dispatch tool call to implementation
+
+**Read-Only Tools:**
+- run_powershell (via shell_runner.py)
+- read_file, list_directory, search_files, get_file_summary
+- py_get_skeleton, py_get_code_outline, get_file_slice
+- py_get_definition, py_get_signature, py_get_class_summary
+- py_get_var_declaration, get_git_diff, py_find_usages
+- py_get_imports, py_check_syntax, py_get_hierarchy, py_get_docstring
+- get_tree, web_search, fetch_url, get_ui_performance
+
+**Mutating Tools:**
+- set_file_slice, py_update_definition, py_set_signature, py_set_var_declaration, edit_file
+
+**Security:**
+- All paths resolved to absolute paths
+- Allowlist built from file_items + base_dirs
+- Blacklist: history files never allowed
+- MUTATING_TOOLS tracked for HITL enforcement
+
+**Usage Pattern:** AI calls these tools via function dispatch; security enforced at dispatch layer.
+
+### MMA Orchestration (4 modules, ~26KB)
+
+#### 9. multi_agent_conductor.py (13.7KB)
+**Purpose:** 4-tier orchestration engine
+
+**Class: `ConductorEngine`**
+- `__init__(track, event_queue, auto_queue)`: Initialize engine
+- `_push_state(status, active_tier)`: Push MMA state update to GUI
+- `parse_json_tickets(json_str)`: Parse JSON tickets into Ticket objects
+- `run(md_content)`: Main execution loop
+  - While True:
+    - ready_tasks = self.engine.tick()
+    - If no ready_tasks: check if all done or blocked
+    - If any in_progress: await asyncio.sleep(1)    # Waiting for async workers
+    - else: await self._push_state("blocked")    # No executable tasks
+    - For each ready task:
+      - If in_progress or (auto_queue and not step_mode):
+        - Mark in_progress, spawn worker
+      - Else if todo: await for HITL approval
+  `confirm_execution(payload, event_queue, ticket_id, loop)`: Push HITL approval for execution
+- `confirm_spawn(role, prompt, context_md, event_queue, ticket_id, loop)`: Push HITL approval for spawn
+
+**Module Functions:**
+- `_queue_put(event_queue, loop, event_name, payload)`: Thread-safe queue push
+- `confirm_execution(payload, event_queue, ticket_id, loop)`: Push HITL approval for execution
+- `confirm_spawn(role, prompt, context_md, event_queue, ticket_id, loop)`: Push HITL approval for spawn
+- `run_worker_lifecycle(ticket, context, context_files, event_queue, engine, md_content, loop)`: Execute single ticket
+  - Reset ai_client session (Context Amnesia)
+  - Build context with AST skeletons
+  - Call ai_client.send() with tools
+  - Handle blocking, completion, errors
+  - Update ticket status
+  - Push state updates
+
+**Usage Pattern:** Orchestrates track execution through DAG-driven worker lifecycle.
+
+#### 10. conductor_tech_lead.py (2.8KB)
+**Purpose:** Tier 2 ticket generation
+
+**Functions:**
+- `generate_tickets(track_brief, module_skeletons)`:
+  - Set tier2_sprint_planning system prompt
+  - Call ai_client.send() with track brief + skeletons
+  - Extract JSON tickets from response (defensive parsing)
+  - Return list of ticket dicts
+- `topological_sort(tickets)`:
+  - Convert dicts to Ticket objects
+  - Build TrackDAG
+  - Call dag.topological_sort()
+  - Return ordered ticket dicts
+  - Raise ValueError on cycle or missing dependency
+
+**Usage Pattern:** Converts track brief into executable ticket DAG.
+
+#### 11. orchestrator_pm.py (4.2KB)
+**Purpose:** Tier 1 strategic planning
+
+**Constants:**
+- `CONDUCTOR_PATH`: Path("conductor")
+
+**Functions:**
+- `get_track_history_summary()`: Scan conductor/archive/ and conductor/tracks/
+  - Read metadata.json from all tracks
+  - Build summary markdown string
+- `generate_tracks(user_request, project_config, file_items, history_summary)`:
+  - Set tier1_epic_init system prompt
+  - Call ai_client.send() with user request + context
+  - Extract JSON tracks from response
+  - Return list of track dicts
+
+**Usage Pattern:** Breaks epic into implementation tracks.
+
+#### 12. mma_prompts.py (5.4KB)
+**Purpose:** System prompt templates for hierarchical orchestration
+
+**Prompt Dictionary:**
+- `TIER1_BASE_SYSTEM`: Tier 1 base system prompt
+- `TIER1_EPIC_INIT`: Epic initialization prompt
+- `TIER1_TRACK_DELEGATION`: Track delegation prompt
+- `TIER1_MACRO_MERGE`: Macro-merge and acceptance prompt
+- `TIER2_BASE_SYSTEM`: Tier 2 base system prompt
+- `TIER2_SPRINT_PLANNING`: Sprint planning prompt
+- `TIER2_CODE_REVIEW`: Code review prompt
+- `TIER2_TRACK_FINALIZATION`: Track finalization prompt
+- `TIER2_CONTRACT_FIRST`: Contract-first delegation prompt
+
+**Usage Pattern:** Provides structured, constraint-focused prompts for each tier.
+
+#### 13. dag_engine.py (5.4KB)
+**Purpose:** Dependency graph and execution engine
+
+**Class: `TrackDAG`**
+- `__init__(tickets)`: Initialize DAG with ticket list
+- `ticket_map`: O(1) lookup by ID
+- `cascade_blocks()`: Transitively mark todo tickets as blocked if any dependency is blocked
+- `get_ready_tasks()`: Return tickets where status==todo and all deps are completed
+- `has_cycle()`: DFS cycle detection (returns bool)
+
+**Class: `ExecutionEngine`**
+- `__init__(dag, auto_queue)`: Initialize engine
+- `tick()`: Evaluate DAG and return ready tasks
+  - If auto_queue: auto-promote non-step_mode tasks to in_progress
+- `approve_task(task_id)`: Manually transition todo to in_progress
+- `update_task_status(task_id, status)`: Force-update status
+
+**Usage Pattern:** Manages ticket execution with dependency resolution and auto-queue vs step_mode.
+### Project & Context Management (5 modules, ~40KB)
+
+#### 14. project_manager.py (13.2KB)
+**Purpose:** TOML config and history persistence
+
+**Constants:**
+- `TS_FMT`: Timestamp format string
+- `CONFIG_PATH`: Path to global config
+
+**Functions:**
+- `now_ts()`: Return formatted timestamp
+- `parse_ts(s)`: Parse timestamp string to datetime
+- `entry_to_str(entry)`: Serialize dict to TOML string
+- `str_to_entry(raw, roles)`: Parse TOML string to dict
+- `get_git_commit(git_dir)`: Get current commit hash
+- `get_git_log(git_dir, n)`: Get last n git log entries
+- `default_discussion()`: Return empty discussion dict
+- `default_project(name)`: Return default project dict
+- `get_history_path(project_path)`: Return path to history TOML
+- `load_project(path)`: Load project TOML (with legacy migration)
+- `load_history(project_path)`: Load segregated history file
+- `clean_nones(data)`: Recursively remove None values
+- `save_project(proj, path, disc_data)`: Save project (segregates discussion)
+- `migrate_from_legacy_config(cfg)`: Migrate legacy config to new format
+- `flat_config(proj, disc_name, track_id)`: Return flat config dict
+- `save_track_state(track_id, state, base_dir)`: Save TrackState to TOML
+- `load_track_state(track_id, base_dir)`: Load TrackState from TOML
+- `load_track_history(track_id, base_dir)`: Load track history from state
+- `save_track_history(track_id, history, base_dir)`: Save track history to state
+- `get_all_tracks(base_dir)`: Scan tracks directory and return track metadata list
+
+**Usage Pattern:** Manages all project and track persistence operations.
+
+#### 15. aggregate.py (14.2KB)
+**Purpose:** Context construction for AI
+
+**Functions:**
+- `find_next_increment(output_dir, namespace)`: Find next filename increment
+- `is_absolute_with_drive(entry)`: Check if path is absolute with drive letter
+- `resolve_paths(base_dir, entry)`: Resolve glob patterns to absolute paths
+- `build_discussion_section(history)`: Build discussion history markdown
+- `build_files_section(base_dir, files)`: Build files section markdown
+- `build_screenshots_section(base_dir, screenshots)`: Build screenshots section markdown
+- `build_file_items(base_dir, files)`: Read all files and build item dicts
+- `build_summary_section(base_dir, files)`: Build compact summary via summarize.py
+- `_build_files_section_from_items(file_items)`: Build files markdown from pre-read items
+- `build_markdown_from_items(file_items, screenshot_base_dir, screenshots, history, summary_only)`: Build full markdown
+- `build_markdown_no_history(file_items, screenshot_base_dir, screenshots, summary_only)`: Build markdown without history
+- `build_discussion_text(history)`: Build discussion text only
+- `build_tier1_context(file_items, screenshot_base_dir, screenshots, history)`: Tier 1 context (strategic, full conductor files)
+- `build_tier2_context(file_items, screenshot_base_dir, screenshots, history)`: Tier 2 context (architectural, full files)
+- `build_tier3_context(file_items, screenshot_base_dir, screenshots, history, focus_files)`: Tier 3 context (execution, focus files full, others skeletal)
+- `build_markdown(base_dir, files, screenshot_base_dir, screenshots, history, summary_only)`: Main context builder
+- `run(config)`: Main entry point - loads project and builds markdown
+
+**Usage Pattern:** Constructs tier-specific context for different MMA tiers.
+
+#### 16. file_cache.py (5.2KB)
+**Purpose:** AST parsing for code views
+
+**Class: `ASTParser`**
+- `__init__(language)`: Initialize parser with tree-sitter
+- `parse(code)`: Parse code to tree-sitter Tree
+- `get_skeleton(code)`: 
+  - Parse code to AST
+  - Walk function_definition nodes
+  - Replace function bodies with "..." (preserve docstrings)
+- `get_curated_view(code)`:
+  - Same as get_skeleton
+  - BUT: preserve bodies with @core_logic decorator or # [HOT] comments
+
+**Usage Pattern:** Generate compact code views for worker context injection.
+
+#### 17. summarize.py (6.5KB)
+**Purpose:** Heuristic file summarizer (no AI calls)
+
+**Summarizer Dictionary:**
+- `_SUMMARISERS`: dict mapping extensions to summarizer functions
+
+**Summarizer Functions:**
+- `_summarise_python(path, content)`:
+  - Parse with ast
+  - Extract imports (deduplicated module names)
+  - Extract ALL_CAPS constants
+  - Extract classes with method names
+  - Extract top-level function names
+  - Output: "**Python � N lines**\nimports: ...\nconstants: ...\nclass ClassName: method1, method2\nfunctions: ..."
+
+- `_summarise_toml(path, content)`:
+  - Parse TOML
+  - Extract top-level table keys
+  - Count array lengths
+  - Output table keys and sizes
+
+- `_summarise_markdown(path, content)`:
+  - Extract h1-h3 headings via regex
+  - Output heading hierarchy
+
+- `_summarise_generic(path, content)`:
+  - Count lines
+  - Output first 8 lines as preview
+
+**Module Functions:**
+- `summarise_file(path, content)`: Dispatch to appropriate summarizer
+- `summarise_items(file_items)`: Summarize all file items
+- `build_summary_markdown(file_items)`: Build markdown summary string
+
+**Usage Pattern:** Generate token-efficient structural summaries.
+
+#### 18. outline_tool.py (1.9KB)
+**Purpose:** Code structure mapper
+
+**Class: `CodeOutliner`**
+- `__init__()`: Initialize
+- `outline(code)`: 
+  - Parse with ast
+  - Walk top-level nodes
+  - Extract ClassDef ? "[Class] Name (Lines X-Y)"
+  - Extract FunctionDef ? "[Func] Name (Lines X-Y)"
+  - Extract AsyncFunctionDef ? "[Async Func] Name (Lines X-Y)"
+  - Extract method signatures from ClassDef bodies
+  - Output hierarchical outline
+
+**Usage Pattern:** Generate detailed code outlines.
+### Logging & History (3 modules, ~17KB)
+
+#### 19. session_logger.py (5.9KB)
+**Purpose:** Timestamped audit trails
+
+**Module-Level State:**
+- `_LOG_DIR`: Path("./logs/sessions")
+- `_SCRIPTS_DIR`: Path("./scripts/generated")
+- `_ts`: Session timestamp string (YYYYMMDD_HHMMSS)
+- `_session_id`: Session ID string
+- `_session_dir`: Path to session subdirectory
+- `_seq`: Monotonic counter for script files
+- `_seq_lock`: threading.Lock for counter
+- `_comms_fh`: File handle for comms.log
+- `_tool_fh`: File handle for toolcalls.log
+- `_api_fh`: File handle for apihooks.log
+- `_cli_fh`: File handle for clicalls.log
+
+**Functions:**
+- `_now_ts()`: Return formatted timestamp
+- `open_session(label)`: 
+  - Create session subdirectory
+  - Open log files (comms, toolcalls, apihooks, clicalls)
+- `close_session()`: Flush and close all log files
+- `log_api_hook(method, path, payload)`: Write to apihooks.log as JSON-L
+- `log_comms(entry)`: Write to comms.log as JSON-L (thread-safe)
+- `log_tool_call(script, result, script_path)`: 
+  - Write to toolcalls.log
+  - Write PS1 script to scripts/generated/
+  - Return script path
+- `log_cli_call(command, stdin_content, stdout_content, stderr_content, latency)`: Write to clicalls.log as JSON-L
+
+**Usage Pattern:** All AI interactions logged for audit trails.
+
+#### 20. log_registry.py (9KB)
+**Purpose:** Session metadata persistence
+
+**Class: `LogRegistry`**
+- `__init__(registry_path)`: Initialize with registry path
+- `_registry_data`: Dict for registry contents
+- `_registry_path`: Path to registry TOML file
+
+**Methods:**
+- `load_registry()`: Load TOML registry from disk
+- `save_registry()`: Save registry to TOML disk
+- `register_session(session_id, path, start_time)`: Add session to registry
+- `update_session_metadata(session_id, message_count, errors, size_kb, whitelisted, reason)`: Update session metadata
+- `is_session_whitelisted(session_id)`: Check whitelist status
+- `update_auto_whitelist_status(session_id)`: 
+  - Analyze session logs
+  - Auto-whitelist if errors present, high message count, or large size
+- `get_old_non_whitelisted_sessions(cutoff_datetime)`: Return old non-whitelisted sessions
+
+**Usage Pattern:** Tracks all sessions with metadata for pruning.
+
+#### 21. log_pruner.py (2.2KB)
+**Purpose:** Automated log cleanup
+
+**Class: `LogPruner`**
+- `__init__(log_registry, logs_dir)`: Initialize with registry and logs dir
+
+**Methods:**
+- `prune()`: 
+  - Get old non-whitelisted sessions from registry
+  - For each session: check total size < 2KB (2048 bytes)
+  - If small and not whitelisted: delete session directory
+
+**Usage Pattern:** Automatically deletes insignificant old logs.
+
+### GUI Layer (2 modules, ~148KB)
+
+#### 22. gui_2.py (77.6KB)
+**Purpose:** ImGui/Dear PyGui interface
+
+**Class: `App`**
+- `__init__()`: Initialize application state
+  - Load config and project
+  - Start asyncio event loop thread
+  - Initialize performance monitor
+  - Initialize theme
+  - Setup GUI panels
+  - Register action handlers
+  - Load fonts
+  - `shutdown()`: Cleanly shutdown all services
+  
+**Key Internal Methods:**
+- `_handle_approve_tool(user_data)`: UI wrapper for tool approval
+- `_handle_approve_mma_step(user_data)`: UI wrapper for MMA step approval
+- `_handle_approve_spawn(user_data)`: UI wrapper for spawn approval
+- `_handle_generate_send(user_data)`: Handle Generate + Send button click
+- `_handle_reset_session(user_data)`: Handle Reset Session button click
+- `_test_callback_func_write_to_file(data)`: Dummy test callback
+- `_load_active_project()`: Load active project from config
+- `_prune_old_logs()`: Async prune old logs on startup
+- `_init_ai_and_hooks()`: Wire AI callbacks to GUI handlers
+- `_init_actions()`: Build action map for _process_pending_gui_tasks
+- `_process_pending_gui_tasks()`: 
+  - Drains task lists under locks
+  - Execute actions (set_value, click, etc.)
+  - Handle approval dialogs
+  - Handle AI responses
+  - Handle MMA state updates
+  - Handle pending history adds
+- `_render_text_viewer(label, content)`: Render text viewer panel
+- `_render_heavy_text(label, content)`: Render large text with clamping
+- `_show_menus()`: Render menu bar
+- `_gui_func()`: Main ImGui render function
+- `_render_projects_panel()`: Render projects hub
+- `_render_context_panel()`: Render context hub
+- `_render_ai_settings_panel()`: Render AI settings hub
+- `_render_discussion_panel()`: Render discussion hub
+- `_render_operations_panel()`: Render operations hub
+- `_render_provider_panel()`: Render provider selection
+- `_render_token_budget_panel()`: Render token budget panel
+- `_render_message_panel()`: Render message input
+- `_render_response_panel()`: Render AI response
+- `_render_tool_calls_panel()`: Render tool call history
+- `_render_comms_history_panel()`: Render comms log viewer
+- `_render_mma_dashboard()`: Render MMA dashboard
+  - Track browser
+  - Ticket DAG visualization
+  - Tier stream panels
+  - Ticket list with actions
+- `_render_tier_stream_panel(tier_key, stream_key | None)`: Render tier stream panels
+- `_render_ticket_dag_node(...)`: Render DAG node
+- `_render_comms_history_panel()`: Render comms history panel
+- `_render_system_prompts_panel()`: Render system prompt editor
+- `_render_theme_panel()`: Render theme selection
+- `_load_fonts()`: Load custom fonts
+- `_post_init()`: Post-initialization setup
+- `run()`: Initialize ImGui runner and start main loop
+
+**Properties:**
+- `current_provider` (get/set): Current AI provider
+- `current_model` (get/set): Current model name
+
+**GUI State Variables:**
+- `_project`, `_projects`: Project data
+- `_files_base_dir`, `_screenshots_base_dir`: File paths
+- `_ai_response`, `_ai_status`: AI response text and status
+- `_comms_log`: Comms log entries
+- `_tool_log`: Tool call log entries
+- `_provider_options`, `_model_options`: Available options
+- `_mma_status`, `_active_tier`, `_mma_streams`: MMA state
+- `_active_track`, `_active_tickets`: Track and ticket data
+- `_pending_gui_tasks`, `_pending_comms`, `_pending_tool_calls`: `_pending_history_adds`: Task queues
+- `_pending_dialog`: Current approval dialog
+- `_pending_mma_approval`: MMA step approval
+- `_pending_mma_spawn`: MMA spawn approval
+- `_pending_ask_dialog`: Ask tool dialog
+- `_show_track_proposal`: Track proposal modal flag
+- `_proposed_tracks`: List of proposed tracks
+
+**Threading Primitives:**
+- `_loop`: asyncio event loop
+- `_loop_thread`: Daemon thread for event loop
+- `_pending_gui_tasks_lock`: Lock for GUI task list
+- `_pending_comms_lock`: Lock for comms list
+- `_pending_tool_calls_lock`: Lock for tool calls list
+- `_pending_history_adds_lock`: Lock for history adds
+- `_pending_dialog_lock`: Lock for dialog state
+- `_send_thread_lock`: Lock for send_thread creation
+- `_pending_actions`: Dict for pending actions
+
+**Usage Pattern:** Main GUI orchestrator with ImGui, event handling, MMA integration.
+### Theming (2 modules, ~27KB)
+
+#### 23. theme.py (15KB)
+**Purpose:** Dear PyGui theming (legacy)
+
+**Palettes:**
+- `_PALETTES`: Dict of palette name to color dict
+  - "DPG Default", "10x Dark", "Nord Dark", "Monokai"
+  - Each palette maps semantic names to RGB tuples
+  - WindowBg, ChildBg, PopupBg, Border, FrameBg, etc.
+
+**Color Mapping:**
+- `_COL_MAP`: Maps semantic names to DPG mvThemeCol_* constants
+
+**State:**
+- `_current_theme_tag`: Current theme tag
+- `_current_font_tag`: Current font tag
+- `_font_registry_tag`: Font registry tag
+- `_current_palette`: Current palette name
+- `_current_font_path`: Current font file path
+- `_current_font_size`: Current font size
+- `_current_scale`: Current scale factor
+
+**Functions:**
+- `get_palette_names()`: Return palette names
+- `get_current_palette()`: Return current palette
+- `get_current_font_path()`: Return font path
+- `get_current_font_size()`: Return font size
+- `get_palette_colours(name)`: Return color dict for palette
+- `apply(palette_name, overrides)`: Apply theme (with optional overrides)
+- `apply_font(font_path, size)`: Load TTF font
+- `set_scale(factor)`: Set UI scale
+- `save_to_config(config)`: Save theme to config
+- `load_from_config(config)`: Load theme from config
+
+**Usage Pattern:** Deprecated Dear PyGui theming system.
+
+#### 24. theme_2.py (12.4KB)
+**Purpose:** ImGui-bundle theming (current)
+
+**Palettes:**
+- `_PALETTES`: Dict of palette name to color dict
+  - "ImGui Dark", "10x Dark", "Nord Dark", "Monokai"
+  - Each palette maps imgui.Col_ enum values to RGBA tuples
+
+**State:**
+- `_current_palette`: Current palette name
+- `_current_font_path`: Current font file path
+- `_current_font_size`: Current font size
+- `_current_scale`: Current scale factor
+- `_custom_font`: Loaded font object
+
+**Functions:**
+- `get_palette_names()`: Return palette names
+- `get_current_palette()`: Return current palette
+- `get_current_font_path()`: Return font path
+- `get_current_font_size()`: Return font size
+- `get_current_scale()`: Return scale
+- `_c(r, g, b, a)`: Helper to convert RGB to RGBA [0-1]
+- `apply(palette_name)`: Apply palette colors to imgui style
+- `set_scale(factor)`: Set font scale
+- `save_to_config(config)`: Save theme to config
+- `load_from_config(config)`: Load theme and scale from config
+- `apply_current()`: Apply loaded palette and scale
+- `get_font_loading_params()`: Return (font_path, size) for font loading
+
+**Usage Pattern:** Current ImGui-bundle theming system.
+### Utilities (1 module, ~1KB)
+
+#### 25. cost_tracker.py (1.2KB)
+**Purpose:** Token cost estimation
+
+**Constants:**
+- `MODEL_PRICING`: List of (pattern, pricing_dict) tuples
+  - gemini-2.5-flash-lite: $0.075/$0.30 per 1M tokens
+  - gemini-2.5-flash: $0.15/$0.60 per 1M tokens
+  - gemini-3-flash-preview: $0.15/$0.60 per 1M tokens
+  - gemini-3.1-pro-preview: $3.50/$10.50 per 1M tokens
+  - claude-sonnet: $3.00/$15.00 per 1M tokens
+  - claude-opus: $15.00/$75.00 per 1M tokens
+  - deepseek-v3: $0.27/$1.10 per 1M tokens
+
+**Functions:**
+- `estimate_cost(model, input_tokens, output_tokens)`: Calculate total cost in USD
+
+**Usage Pattern:** Estimate token costs for budget tracking.
+## Part 8: Deep Testing & Simulation Architecture Analysis
+
+### Complete File Inventory (100+ test files, ~50KB total)
+
+#### Fixtures & Infrastructure (4 files)
+
+##### conftest.py (10.7KB)
+**Purpose:** Central test configuration and shared fixtures
+
+**Classes:**
+- `VerificationLogger`: Structured diagnostic logging
+  - `__init__(test_name, script_name)`: Initialize with test name and logs dir
+  - `log_state(field, before, after, delta)`: Log state change
+  - `finalize(title, status, result_msg)`: Finalize test log
+
+**Functions:**
+- `kill_process_tree(pid)`: Robustly kill process and all children
+  - Windows: `taskkill /F /T /PID <pid>`
+  - Unix: `os.killpg(os.getpgid(pid), SIGKILL)`
+
+**Fixtures:**
+- `vlogger(request)`: Provide VerificationLogger instance
+- `mock_app()`: Mock version of App for simple unit tests
+- `app_instance()`: Centralized App instance with all external side effects mocked
+- `live_gui(scope="session")`: 
+  - Spawn sloppy.py with --enable-test-hooks
+  - Use CREATE_NEW_PROCESS_GROUP on Windows
+  - Redirect stdout/stderr to logs/gui_2_py_test.log
+  - Poll GET /status up to 15 seconds
+  - Check process.poll() each iteration to detect early crashes
+  - If hook server never responds: kill process, pytest.fail()
+  - In finally: reset_session(), kill_process_tree(), close log_file
+  - Yield (process, gui_script)
+
+**Issues:**
+- No cleanup verification in reset_ai_client fixture (no teardown)
+- Timeout-based readiness check may miss transient initialization states
+
+##### mock_alias_tool.py (1.1KB)
+**Purpose:** Simulates tool calling with alias resolution
+
+**Behavior:**
+- Reads prompt from stdin
+- If `'"role": "tool"'` in prompt: Returns mock success
+- Else: Calls bridge script (cli_tool_bridge.py) for alias resolution
+- Outputs two JSON-L lines: message + result
+
+##### mock_context_bleed.py (382 bytes)
+**Purpose:** Mock agent for bleed testing
+
+**Behavior:**
+- Prints init event with session_id
+- Prints user message
+- Prints assistant response
+- Prints result with stats
+
+##### mock_gemini_cli.py (7.4KB)
+**Purpose:** Fake Gemini CLI executable for integration tests
+
+**Main Function:**
+- `main()`: 
+  - Check sys.argv for management commands ("mcp", "extensions", "skills", "hooks")
+  - If management command: return silently
+  - Read prompt from stdin
+  - Route to response based on keyword matching
+
+**Response Routing:**
+- `'"role": "tool"'` OR `"tool_call_id"` in prompt:
+  - Return "Tool worked!" success message
+- `'"PATH: Epic Initialization"'` in prompt:
+  - Return two mock Track objects with IDs "mock-track-1", "mock-track-2"
+  - Session ID: "mock-session-epic"
+- `'"PATH: Sprint Planning"'` in prompt:
+  - Return two mock Ticket objects
+    - "mock-ticket-1": independent
+    - "mock-ticket-2": depends on "mock-ticket-1"
+  - Session ID: "mock-session-sprint"
+- Default (Tier 3 worker prompts):
+  - Return "SUCCESS: Mock Tier 3 worker implemented change. [MOCK OUTPUT]"
+  - Session ID: "mock-session-default"
+
+**Output Format:**
+- Every response is exactly two JSON-L lines
+- Line 1: `{"type": "message", "role": "assistant", "content": "..."}`
+- Line 2: `{"type": "result", "status": "success", "stats": {"total_tokens": N}, "session_id": "mock-session-*"}`
+
+**Debug Output:**
+- All debug info goes to stderr, keeping stdout clean for JSON-L protocol
+
+**Critical Flaws:**
+1. **No input validation**: Never checks if prompt contains garbage
+2. **No logic verification**: Never verifies Tier 2 actually generated valid tickets, only that it called prompt
+3. **No failure modes**: Real LLM errors (malformed JSON, timeouts, rate limits) never tested
+4. **No tool flow verification**: Doesn't verify tools were called correctly
+5. **Deterministic responses**: Can't test conversation context, multi-turn flows, or error recovery
+
+#### Architecture Boundary Tests (3 files)
+
+##### test_arch_boundary_phase1.py (3.8KB)
+**Purpose:** Tests architecture boundary hardening � Phase 1
+
+**Test Class:** `TestArchBoundaryPhase1(unittest.TestCase)`
+
+**Tests:**
+- `test_unfettered_modules_constant_removed()`: Check "UNFETTERED_MODULES" string absent from execute_agent source
+- `test_full_module_context_never_injected()`: Verify "FULL MODULE CONTEXT" not in captured input for mcp_client
+- `test_skeleton_used_for_mcp_client()`: Verify "DEPENDENCY SKELETON" is used for mcp_client
+- `test_mma_exec_no_hardcoded_path()`: mma_exec.execute_agent must not contain hardcoded machine paths
+- `test_claude_mma_exec_no_hardcoded_path()`: claude_mma_exec.execute_agent must not contain hardcoded machine paths
+
+**Coverage:** Code structure verification only
+
+##### test_arch_boundary_phase2.py (6.7KB)
+**Purpose:** Tests architecture boundary hardening � Phase 2
+
+**Constants:**
+- `MUTATING_TOOLS`: {"set_file_slice", "py_update_definition", "py_set_signature", "py_set_var_declaration"}
+- `ALL_DISPATCH_TOOLS`: All 26 tool names
+
+**Functions:**
+- `test_toml_exposes_all_dispatch_tools()`: manual_slop.toml [agent.tools] must list every tool
+- `test_toml_mutating_tools_disabled_by_default()`: Mutating tools must default to false in manual_slop.toml
+- `test_default_project_exposes_all_dispatch_tools()`: default_project() agent.tools must list every tool
+- `test_default_project_mutating_tools_disabled()`: Mutating tools must default to False in default_project()
+- `test_gui_agent_tool_names_exposes_all_dispatch_tools()`: AGENT_TOOL_NAMES in gui_2.py must include every tool
+- `test_mcp_client_has_mutating_tools_constant()`: mcp_client must expose MUTATING_TOOLS frozenset
+- `test_mutating_tools_contains_write_tools()`: MUTATING_TOOLS must include all four write tools
+- `test_mutating_tools_excludes_read_tools()`: MUTATING_TOOLS must not include read-only tools
+- `test_mutating_tool_triggers_pre_tool_callback(monkeypatch)`: When mutating tool is called and pre_tool_callback is set, it must be invoked
+- `test_mutating_tool_skips_callback_when_rejected(monkeypatch)`: When pre_tool_callback returns None (rejected), dispatch must NOT be called
+- `test_non_mutating_tool_skips_callback()`: Read-only tools must NOT trigger pre_tool_callback
+
+**Coverage:** Tool config exposure and HITL enforcement verification
+
+##### test_arch_boundary_phase3.py (2.9KB)
+**Purpose:** Tests architecture boundary hardening � Phase 3 (not in current set)
+
+**Tests:**
+- `test_cascade_blocks_simple()`: Blocked dependency blocks immediate dependent
+- `test_cascade_blocks_multi_hop()`: Blocking cascades through multiple levels
+- `test_cascade_blocks_no_cascade_to_completed()`: Completed tasks not changed even if dependency blocked
+- `test_cascade_blocks_partial_dependencies()`: Partial dependencies blocked ? dependent blocked
+- `test_cascade_blocks_already_in_progress()`: In-progress tasks not blocked automatically
+
+**Coverage:** DAG blocking cascade verification
+#### API Hooks & Integration (8 files)
+
+##### test_api_hook_client.py (3.5KB)
+**Purpose:** Test ApiHookClient methods against live GUI
+
+**Tests:**
+- `test_get_status_success(live_gui)`: Check get_status retrieves server health
+- `test_get_project_success(live_gui)`: Check project data retrieval
+- `test_get_session_success(live_gui)`: Check session data retrieval
+- `test_post_gui_success(live_gui)`: Check GUI data posting
+- `test_get_performance_success(live_gui)`: Check performance metrics retrieval
+- `test_unsupported_method_error()`: Unsupported HTTP method raises ValueError
+- `test_get_text_value()`: Test get_text_value wrapper
+- 'test_get_node_status()`: Test DAG node status retrieval
+
+**Coverage:** Hook API client verification
+
+##### test_api_hook_extensions.py (1.9KB)
+**Purpose:** Test API hook extensions for UI interaction
+
+**Tests:**
+- `test_api_client_has_extensions()`: Verify ApiHookClient has extension methods
+- `test_select_tab_integration(live_gui)`: Test tab selection via hooks
+- `test_select_list_item_integration(live_gui)`: Test list item selection
+- `test_get_indicator_state_integration(live_gui)`: Test indicator state retrieval
+- `test_app_processes_new_actions()`: Test new action processing
+
+**Coverage:** UI interaction via Hook API
+
+##### test_conductor_api_hook_integration.py (2.7KB)
+**Purpose:** Test Conductor integration via hooks
+
+**Functions:**
+- `simulate_conductor_phase_completion(client)`: Simulates Conductor phase completion using ApiHookClient
+- `test_conductor_integrates_api_hook_client_for_verification(live_gui)`: Verify Conductor uses ApiHookClient for verification
+- `test_conductor_handles_api_hook_failure(live_gui)`: Verify Conductor handles API hook verification failure
+- `test_conductor_handles_api_hook_connection_error()`: Verify Conductor handles connection error
+
+**Coverage:** Conductor-Hook API integration verification
+
+##### test_headless_service.py (7KB)
+**Purpose:** Test headless API service
+
+**Test Class:** `TestHeadlessAPI(unittest.TestCase)`
+
+**Tests:**
+- `test_health_endpoint()`: Check /status endpoint
+- `test_status_endpoint_unauthorized()`: Check /status without API key
+- `test_status_endpoint_authorized()`: Check /status with valid API key
+- `test_generate_endpoint()`: Check /generate endpoint
+- `test_pending_actions_endpoint()`: Check /pending_actions endpoint
+- `test_confirm_action_endpoint()`: Check /confirm endpoint
+- `test_list_sessions_endpoint()`: Check /sessions endpoint
+- `test_get_context_endpoint()`: Check /context endpoint
+- `test_endpoint_no_api_key_configured()`: Check behavior without API key
+
+**Test Class:** `TestHeadlessStartup(unittest.TestCase)`
+
+**Tests:**
+- `test_headless_flag_prevents_gui_run()`: --headless flag prevents GUI, runs FastAPI
+- `test_normal_startup_calls_gui_run()`: Normal startup calls GUI
+
+**Functions:**
+- `test_fastapi_installed()`: Verify FastAPI installed
+- `test_uvicorn_installed()`: Verify Uvicorn installed
+
+**Coverage:** FastAPI endpoints and startup verification
+
+##### test_headless_verification.py (7KB)
+**Purpose:** Test headless verification without GUI
+
+**Tests:**
+- `test_headless_verification_full_run(vlogger)`: 
+  - Initialize ConductorEngine with Track
+  - Simulate full execution run
+  - Mock ai_client.send for successful tool calls and final responses
+  - Verify Context Amnesia is maintained
+- `test_headless_verification_error_and_qa_interceptor(vlogger)`: 
+  - Simulate shell error
+  - Verify Tier 4 QA interceptor is triggered
+  - Verify summary is injected into worker history for next retry
+
+**Coverage:** Full conductor run verification
+#### GUI Testing (26 files)
+
+##### test_gui2_events.py (1.7KB)
+**Purpose:** Test GUI event subscriptions
+
+**Fixture:**
+- `app_instance()`: Create App instance with mocked render functions
+
+**Tests:**
+- `test_app_subscribes_to_events(app_instance)`: Verify App.__init__ subscribes necessary event handlers
+- `test_handle_ai_response_resets_stream(app_instance)`: Verify handle_ai_response replaces/finalizes stream
+- `test_user_request_event_payload()`: Verify UserRequestEvent payload structure
+- `test_async_event_queue()`: Verify AsyncEventQueue put/get
+
+**Coverage:** Event system integration
+
+##### test_gui2_layout.py (1KB)
+**Purpose:** Test GUI hub layout
+
+**Tests:**
+- `test_gui2_hubs_exist_in_show_windows(app_instance)`: Verify new Hub windows in show_windows
+- `test_gui2_old_windows_removed_from_show_windows(app_instance)`: Verify old windows removed
+
+**Coverage:** Layout structure verification
+
+##### test_gui2_mcp.py (2KB)
+**Purpose:** Test MCP tool dispatch
+
+**Tests:**
+- `test_mcp_tool_call_is_dispatched(app_instance)`: Verify tool calls dispatched to mcp_client
+
+**Coverage:** MCP integration verification
+
+##### test_gui2_parity.py (3KB)
+**Purpose:** Test GUI hooks for value setting and clicking
+
+**Tests:**
+- `test_gui2_set_value_hook_works(live_gui)`: Test set_value GUI hook
+- `test_gui2_click_hook_works(live_gui)`: Test click hook for Reset button
+- `test_gui2_custom_callback_hook_works(live_gui)`: Test custom_callback hook
+
+**Coverage:** Hook parity verification
+
+##### test_gui2_performance.py (2.3KB)
+**Purpose:** Test performance benchmarking
+
+**Tests:**
+- `test_performance_benchmarking(live_gui)`: Collects performance metrics for current GUI script
+- `test_performance_baseline_check()`: Verifies performance metrics exist
+
+**Coverage:** Performance tracking verification
+
+##### test_gui_async_events.py (2.5KB)
+**Purpose:** Test async event routing
+
+**Tests:**
+- `test_handle_generate_send_pushes_event(mock_gui)`: Verify handle_generate_send pushes UserRequestEvent
+- `test_user_request_event_payload()`: Verify UserRequestEvent structure
+- `test_async_event_queue()`: Verify AsyncEventQueue operation
+
+**Coverage:** Async event system verification
+
+##### test_gui_diagnostics.py (861 bytes)
+**Purpose:** Test diagnostics panel
+
+**Tests:**
+- `test_diagnostics_panel_initialization(app_instance)`: Verify diagnostics panel initializes
+- `test_diagnostics_history_updates(app_instance)`: Verify performance history updates correctly
+
+**Coverage:** Diagnostics verification
+
+##### test_gui_events.py (803 bytes)
+**Purpose:** Test GUI event handling
+
+**Tests:**
+- `test_gui_updates_on_event(app_instance)`: Verify GUI updates on events
+
+**Coverage:** Event-driven UI updates
+
+##### test_gui_phase3.py (3.1KB)
+**Purpose:** Test GUI phase 3 features
+
+**Tests:**
+- `test_track_proposal_editing(app_instance)`: Track proposal editing
+- `test_conductor_setup_scan(app_instance, tmp_path)`: Conductor directory scan
+- `test_create_track(app_instance, tmp_path)`: Track creation
+
+**Coverage:** Track proposal workflow
+
+##### test_gui_phase4.py (7.4KB)
+**Purpose:** Test GUI phase 4 features
+
+**Tests:**
+- `test_add_ticket_logic(mock_app)`: Add ticket logic
+- `test_remove_ticket_logic(mock_app)`: Remove ticket logic
+- `test_toggle_ticket_logic(mock_app)`: Toggle ticket logic
+
+**Coverage:** Ticket management workflow
+
+##### test_gui_streaming.py (4.1KB)
+**Purpose:** Test MMA streaming
+
+**Tests:**
+- `test_mma_stream_event_routing(app_instance)`: Verify "mma_stream" events reach mma_streams
+- `test_mma_stream_multiple_workers(app_instance)`: Verify streaming works for multiple workers
+- `test_handle_ai_response_resets_stream(app_instance)`: Verify final response replaces stream
+- `test_handle_ai_response_streaming(app_instance)`: Verify streaming appends to mma_streams
+
+**Coverage:** Stream routing verification
+
+##### test_gui_stress_performance.py (1.8KB)
+**Purpose:** Test stress performance
+
+**Tests:**
+- `test_comms_volume_stress_performance(live_gui)`: Inject many session entries and verify performance
+
+**Coverage:** Stress testing
+
+##### test_gui_updates.py (1.7KB)
+**Purpose:** Test GUI update mechanisms
+
+**Tests:**
+- `test_telemetry_data_updates_correctly(app_instance)`: Verify telemetry updates
+- 'test_performance_history_updates(app_instance)`: Verify performance history
+- `test_gui_updates_on_event(app_instance)`: Verify GUI updates on events
+
+**Coverage:** Update mechanism verification
+
+##### test_layout_reorganization.py (2.4KB)
+**Purpose:** Test new consolidated Hub layout
+
+**Tests:**
+- `test_new_hubs_defined_in_show_windows(mock_app)`: Verify Hub windows defined
+- `test_old_windows_removed_from_gui2(app_instance)`: Verify old windows removed
+
+**Coverage:** Layout consolidation verification
+
+##### test_live_gui_integration.py (3.1KB)
+**Purpose:** Test user request flow integration
+
+**Tests:**
+- `test_user_request_integration_flow(mock_app)`: Verify UserRequestEvent triggers AI and updates UI
+
+**Coverage:** Integration flow verification
+
+##### test_live_workflow.py (3KB)
+**Purpose:** Test full GUI workflow
+
+**Tests:**
+- `test_full_live_workflow(live_gui)`: Integration test driving GUI through full workflow
+
+**Coverage:** End-to-end workflow verification
+#### MMA & Conductor (11 files)
+
+##### test_conductor_engine.py (14.6KB)
+**Purpose:** Test ConductorEngine implementation
+
+**Tests:**
+- `test_conductor_engine_initialization()`: Test ConductorEngine initialization
+- `test_conductor_engine_run_executes_tickets_in_order(monkeypatch, vlogger)`: Test run() executes tickets in order
+- `test_run_worker_lifecycle_calls_ai_client_send(monkeypatch)`: Test worker lifecycle triggers AI client send
+- `test_run_worker_lifecycle_context_injection(monkeypatch)`: Test worker lifecycle injects AST views
+- `test_run_worker_lifecycle_blocked(mock_ai_client)`: Test worker marks ticket as blocked
+- `test_run_worker_lifecycle_step_mode_confirmation(monkeypatch)`: Test step mode confirmation
+- `test_conductor_engine_dynamic_parsing_and_execution(monkeypatch, vlogger)`: Test dynamic parsing and execution
+- `test_run_worker_lifecycle_streams_response_via_queue(monkeypatch)`: Test streaming response via queue
+- `test_run_worker_lifecycle_token_usage_from_comms_log(monkeypatch)`: Test token usage from comms log
+
+**Coverage:** Conductor engine behavior verification
+
+##### test_conductor_tech_lead.py (3.4KB)
+**Purpose:** Test Conductor Tech Lead
+
+**Test Class:** `TestConductorTechLead(unittest.TestCase)`
+
+**Tests:**
+- `test_generate_tickets_parse_error()`: Test JSON parsing error handling
+- `test_generate_tickets_success()`: Test successful ticket generation
+- `test_topological_sort_linear()`: Test linear topological sort
+- `test_topological_sort_complex()`: Test complex topological sort
+- `test_topological_sort_cycle()`: Test cycle detection
+- `test_topological_sort_empty()`: Test empty list handling
+- `test_topological_sort_missing_dependency()`: Test missing dependency handling
+
+**Coverage:** Ticket generation and dependency resolution verification
+
+##### test_dag_engine.py (3.7KB)
+**Purpose:** Test TrackDAG and ExecutionEngine
+
+**Tests:**
+- `test_get_ready_tasks_linear()`: Test ready tasks for linear DAG
+- `test_get_ready_tasks_branching()`: Test ready tasks for branching DAG
+- `test_has_cycle_no_cycle()`: Test cycle detection with no cycles
+- `test_has_cycle_direct_cycle()`: Test direct cycle detection
+- `test_has_cycle_indirect_cycle()`: Test indirect cycle detection
+- `test_has_cycle_complex_no_cycle()`: Test complex DAG with no cycles
+- `test_get_ready_tasks_multiple_deps()`: Test multiple dependencies
+- `test_topological_sort()`: Test topological sort
+- `test_topological_sort_cycle()`: Test topological sort with cycles
+
+**Coverage:** DAG algorithm verification
+
+##### test_execution_engine.py (4.1KB)
+**Purpose:** Test ExecutionEngine
+
+**Tests:**
+- `test_execution_engine_basic_flow()`: Test basic flow
+- `test_execution_engine_update_nonexistent_task()`: Test updating non-existent task
+- `test_execution_engine_status_persistence()`: Test status persistence
+- `test_execution_engine_auto_queue()`: Test auto_queue behavior
+- `test_execution_engine_step_mode()`: Test step_mode behavior
+- `test_execution_engine_approve_task()`: Test approve_task behavior
+
+**Coverage:** Execution engine state machine verification
+
+##### test_mma_models.py (5.9KB)
+**Purpose:** Test MMA data models
+
+**Tests:**
+- `test_ticket_instantiation()`: Test Ticket instantiation with required fields
+- `test_ticket_with_dependencies()`: Test Ticket with dependencies
+- `test_track_instantiation()`: Test Track instantiation
+- `test_track_can_handle_empty_tickets()`: Test Track with empty tickets
+- `test_worker_context_instantiation()`: Test WorkerContext instantiation
+- `test_ticket_mark_blocked()`: Test ticket.mark_blocked()
+- `test_ticket_mark_complete()`: Test ticket.mark_complete()
+- `test_track_get_executable_tickets()`: Test track.get_executable_tickets()
+- `test_track_get_executable_tickets_complex()`: Test get_executable_tickets with complex dependencies
+
+**Coverage:** Model invariant verification
+
+##### test_mma_orchestration_gui.py (4.6KB)
+**Purpose:** Test MMA GUI state and orchestration
+
+**Tests:**
+- `test_mma_ui_state_initialization(app_instance)`: Verify UI state initialization
+- `test_process_pending_gui_tasks_show_track_proposal(app_instance)`: Verify show_track_proposal action
+- `test_cb_plan_epic_launches_thread(app_instance)`: Verify plan epic launches thread
+- `test_process_pending_gui_tasks_mma_spawn_approval(app_instance)`: Verify spawn approval action
+- `test_handle_ai_response_with_stream_id(app_instance)`: Verify routing to mma_streams
+- `test_handle_ai_response_fallback(app_instance)`: Verify fallback to ai_response
+
+**Coverage:** MMA GUI state management
+
+##### test_mma_prompts.py (1.9KB)
+**Purpose:** Test MMA system prompts
+
+**Tests:**
+- `test_tier1_epic_init_constraints()`: Verify Tier 1 epic init prompt constraints
+- `test_tier1_track_delegation_constraints()`: Verify Tier 1 track delegation prompt constraints
+- `test_tier1_macro_merge_constraints()`: Verify Tier 1 macro-merge prompt constraints
+- `test_tier2_sprint_planning_constraints()`: Verify Tier 2 sprint planning prompt constraints
+- `test_tier2_code_review_constraints()`: Verify Tier 2 code review prompt constraints
+- `test_tier2_track_finalization_constraints()`: Verify Tier 2 track finalization prompt constraints
+- `test_tier2_contract_first_constraints()`: Verify Tier 2 contract-first constraints
+
+**Coverage:** System prompt constraint verification
+
+##### test_mma_ticket_actions.py (1.1KB)
+**Purpose:** Test MMA ticket actions
+
+**Tests:**
+- `test_cb_ticket_retry(app_instance)`: Test ticket retry callback
+- `test_cb_ticket_skip(app_instance)`: Test ticket skip callback
+
+**Coverage:** Ticket action verification
+
+##### test_orchestration_logic.py (5KB)
+**Purpose:** Test orchestration logic
+
+**Tests:**
+- `test_generate_tracks(mock_ai_client)`: Test Tier 1 track generation
+- `test_generate_tickets(mock_ai_client)`: Test Tier 2 ticket generation
+- `test_topological_sort()`: Test topological sort
+- `test_track_executable_tickets()`: Test track.get_executable_tickets()
+- `test_conductor_engine_run(vlogger)`: Test ConductorEngine.run()
+- `test_parse_json_tickets()`: Test ticket JSON parsing
+- `test_run_worker_lifecycle_blocked(mock_ai_client)`: Test worker blocked handling
+
+**Coverage:** Orchestration logic verification
+
+##### test_orchestrator_pm.py (3KB)
+**Purpose:** Test Orchestrator PM
+
+**Test Class:** `TestOrchestratorPM(unittest.TestCase)`
+
+**Tests:**
+- `test_generate_tracks_success(mock_send, mock_summarize)`: Test successful track generation
+- `test_generate_tracks_markdown_wrapped(mock_send, mock_summarize)`: Test markdown wrapping
+- `test_generate_tracks_malformed_json(mock_send, mock_summarize)`: Test malformed JSON handling
+
+**Coverage:** Orchestrator PM verification
+
+##### test_orchestrator_pm_history.py (2.8KB)
+**Purpose:** Test Orchestrator PM history
+
+**Test Class:** `TestOrchestratorPMHistory(unittest.TestCase)`
+
+**Tests:**
+- `test_get_track_history_summary()`: Test history summary generation
+- `test_get_track_history_summary_missing_files()`: Test missing file handling
+- `test_generate_tracks_with_history(mock_send, mock_summarize, mock_registry)`: Test track generation with history
+
+**Coverage:** History summary verification
+#### MCP & Tools (4 files)
+
+##### test_agent_capabilities.py (1.3KB)
+**Purpose:** Test agent model listing
+
+**Tests:**
+- `test_agent_capabilities_listing()`: Test model listing
+
+**Coverage:** Capability verification
+
+##### test_agent_tools_wiring.py (865 bytes)
+**Purpose:** Test agent tools wiring
+
+**Tests:**
+- `test_set_agent_tools()`: Test set_agent_tools() function
+- `test_build_anthropic_tools_conversion()`: Test Anthropic tools conversion
+
+**Coverage:** Tool setup verification
+
+##### test_cli_tool_bridge.py (2.5KB)
+**Purpose:** Test CLI tool bridge
+
+**Test Class:** `TestCliToolBridge(unittest.TestCase)`
+
+**Tests:**
+- `test_allow_decision(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test allow decision flow
+- `test_deny_decision(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test deny decision flow
+- `test_unreachable_hook_server(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test unreachable server handling
+
+**Coverage:** Bridge decision logic verification
+
+##### test_cli_tool_bridge_mapping.py (1.8KB)
+**Purpose:** Test CLI tool bridge mapping
+
+**Test Class:** `TestCliToolBridgeMapping(unittest.TestCase)`
+
+**Tests:**
+- `test_mapping_from_api_format(mock_request, mock_stdout, mock_stdin, mock_hook)`: Verify mapping from API format
+
+**Coverage:** Format mapping verification
+
+##### test_mcp_perf_tool.py (607 bytes)
+**Purpose:** Test MCP performance tool
+
+**Tests:**
+- `test_mcp_perf_tool_retrieval()`: Test get_ui_performance retrieval
+
+**Coverage:** Performance tool verification
+#### Simulations (8 files)
+
+##### test_extended_sims.py (2.7KB)
+**Purpose:** Extended simulations against live GUI
+
+**Tests:**
+- `test_context_sim_live(live_gui)`: Context simulation
+- `test_ai_settings_sim_live(live_gui)`: AI settings simulation
+- `test_tools_sim_live(live_gui)`: Tools simulation
+- `test_execution_sim_live(live_gui)`: Execution simulation
+
+**Coverage:** Multi-simulation integration
+
+##### test_sim_ai_settings.py (1.4KB)
+**Purpose:** Test AI settings simulation
+
+**Tests:**
+- `test_ai_settings_simulation_run()`: Test simulation runs correctly
+
+**Coverage:** AI settings simulation verification
+
+##### test_sim_base.py (1.2KB)
+**Purpose:** Test base simulation class
+
+**Tests:**
+- `test_base_simulation_init()`: Test base simulation initialization
+- `test_base_simulation_setup()`: Test base simulation setup
+
+**Coverage:** Base simulation verification
+
+##### test_sim_context.py (1.4KB)
+**Purpose:** Test context simulation
+
+**Tests:**
+- `test_context_simulation_run()`: Test context simulation runs correctly
+
+**Coverage:** Context simulation verification
+
+##### test_sim_execution.py (1.5KB)
+**Purpose:** Test execution simulation
+
+**Tests:**
+- `test_execution_simulation_run()`: Test execution simulation runs correctly
+
+**Coverage:** Execution simulation verification
+
+##### test_sim_tools.py (1.2KB)
+**Purpose:** Test tools simulation
+
+**Tests:**
+- `test_tools_simulation_run()`: Test tools simulation runs correctly
+
+**Coverage:** Tools simulation verification
+
+##### test_user_agent.py (633 bytes)
+**Purpose:** Test user agent
+
+**Tests:**
+- `test_user_agent_instantiation()`: Test UserSimAgent instantiation
+- `test_perform_action_with_delay()`: Test action with delay
+
+**Coverage:** User agent behavior verification
+
+##### test_workflow_sim.py (1.5KB)
+**Purpose:** Test workflow simulator
+
+**Tests:**
+- `test_simulator_instantiation()`: Test simulator instantiation
+- `test_setup_new_project()`: Test project setup
+- `test_discussion_switching()`: Test discussion switching
+- `test_history_truncation()`: Test history truncation
+
+**Coverage:** Workflow simulation verification
+
+##### ping_pong.py (1.6KB)
+**Purpose:** Simple ping/pong test
+
+**Main Function:**
+- `main()`: Basic agent interaction verification
+
+**Coverage:** Minimal simulation test
+
+##### live_walkthrough.py (2.6KB)
+**Purpose:** Full walkthrough script
+
+**Main Function:**
+- `main()`: Orchestrates complete GUI workflow via hooks
+
+**Coverage:** End-to-end workflow verification
+#### Approval & HITL (2 files)
+
+##### test_spawn_interception.py (3.4KB)
+**Purpose:** Test spawn approval interception
+
+**Tests:**
+- `test_confirm_spawn_pushed_to_queue()`: Test confirm_spawn pushes to queue
+- `test_rejection_handling(mock_confirm, mock_ai_client, app_instance)`: Test rejection handling
+- `test_run_worker_lifecycle_rejected(mock_confirm, mock_ai_client, app_instance)`: Test worker lifecycle on rejection
+
+**Coverage:** Spawn approval flow verification
+
+##### test_tier4_interceptor.py (8.1KB)
+**Purpose:** Test Tier 4 QA interceptor
+
+**Tests:**
+- `test_run_powershell_qa_callback_on_failure(vlogger)`: Test QA callback on shell failure
+- `test_run_powershell_qa_callback_on_stderr_only(vlogger)`: Test QA callback on stderr
+- `test_run_powershell_no_qa_callback_on_success()`: Test no QA callback on success
+- `test_run_powershell_optional_qa_callback()`: Test optional QA callback
+- `test_end_to_end_tier4_integration(vlogger)`: Test end-to-end Tier 4 integration
+- `test_ai_client_passes_qa_callback()`: Test ai_client passes qa_callback
+- `test_gemini_provider_passes_qa_callback_to_run_script()`: Test Gemini passes qa_callback
+
+**Coverage:** Tier 4 QA flow verification
+#### Token Management (2 files)
+
+##### test_token_usage.py (2.4KB)
+**Purpose:** Test token usage tracking
+
+**Tests:**
+- `test_token_usage_tracking()`: Test token tracking
+
+**Coverage:** Token usage verification
+
+##### test_token_viz.py (5.3KB)
+**Purpose:** Test token visualization
+
+**Tests:**
+- `test_add_bleed_derived_aliases()`: Test bleed stats aliases
+- `test_add_bleed_derived_headroom()`: Test headroom calculation
+- `test_add_bleed_derived_would_trim_false()`: Test trim boundary
+- `test_add_bleed_derived_would_trim_true()`: Test just below threshold
+- `test_add_bleed_derived_breakdown()`: Test breakdown calculation
+- `test_add_bleed_derived_history_clamped_to_zero()`: Test history clamping
+- `test_add_bleed_derived_headroom_clamped_to_zero()`: Test headroom clamping
+- `test_get_history_bleed_stats_returns_all_keys_unknown_provider()`: Test all keys returned
+- `test_app_token_stats_initialized_empty(app_instance)`: Test token stats initialization
+- `test_app_last_stable_md_initialized_empty(app_instance)`: Test last stable MD initialization
+- `test_app_has_render_token_budget_panel(app_instance)`: Test token budget panel exists
+- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)`: Test panel rendering with empty stats
+- `test_would_trim_boundary_exact()`: Test exact trim boundary
+- `test_would_trim_just_below_threshold()`: Test just below threshold
+- `test_would_trim_just_above_threshold()`: Test just above threshold
+- `test_gemini_cache_fields_accessible()`: Test Gemini cache field access
+- `test_anthropic_history_lock_accessible()`: Test Anthropic history lock access
+
+**Coverage:** Token budget panel verification
+#### Tiered Context (1 file)
+
+##### test_tiered_context.py (5.1KB)
+**Purpose:** Test tiered context building
+
+**Tests:**
+- `test_build_tier1_context_exists()`: Test build_tier1_context function exists
+- `test_build_tier2_context_exists()`: Test build_tier2_context function exists
+- `test_build_tier3_context_ast_skeleton(monkeypatch)`: Test build_tier3_context uses AST skeletons
+- `test_build_tier3_context_exists()`: Test build_tier3_context function exists
+- `test_build_file_items_with_tiers(tmp_path)`: Test file items with tiers
+- `test_build_files_section_with_dicts(tmp_path)`: Test files section building
+- `test_tiered_context_by_tier_field()`: Test tiered context by tier field
+
+**Coverage:** Tiered context verification
+#### Logging & History (5 files)
+
+##### test_history_management.py (8.7KB)
+**Purpose:** Test history management
+
+**Tests:**
+- `test_aggregate_includes_segregated_history(tmp_path)`: Test aggregate includes segregated history
+- `test_mcp_blacklist(tmp_path)`: Test MCP client blacklists files
+- `test_aggregate_blacklist(tmp_path)`: Test aggregate respects blacklisting
+- `test_migration_on_load(tmp_path)`: Test migration on load
+- `test_save_separation(tmp_path)`: Test save separation
+- `test_history_persistence_across_turns(tmp_path)`: Test persistence across turns
+- `test_get_history_bleed_stats_basic()`: Test history bleed stats
+
+**Coverage:** History persistence verification
+
+##### test_log_management_ui.py (3.2KB)
+**Purpose:** Test log management UI
+
+**Fixture:**
+- `mock_config(tmp_path)`: Mock config
+- `mock_project(tmp_path)`: Mock project
+- `app_instance(mock_config, mock_project, monkeypatch)`: App instance
+
+**Tests:**
+- `test_log_management_init(app_instance)`: Test log management initialization
+- `test_render_log_management_logic(app_instance)`: Test render logic
+
+**Coverage:** Log UI verification
+
+##### test_log_pruner.py (2.3KB)
+**Purpose:** Test log pruning
+
+**Fixture:**
+- `pruner_setup(tmp_path)`: Tuple of LogPruner, LogRegistry, Path
+
+**Tests:**
+- `test_prune_old_insignificant_logs(pruner_setup)`: Test pruning old insignificant logs
+
+**Coverage:** Log pruning verification
+
+##### test_log_registry.py (8.6KB)
+**Purpose:** Test log registry
+
+**Test Class:** `TestLogRegistry(unittest.TestCase)`
+
+**Tests:**
+- `test_instantiation()`: Test LogRegistry instantiation
+- `test_register_session()`: Test session registration
+- `test_update_session_metadata()`: Test metadata update
+- `test_is_session_whitelisted()`: Test whitelist checking
+- `test_get_old_non_whitelisted_sessions()`: Test retrieval of old sessions
+
+**Coverage:** Registry functionality verification
+
+##### test_logging_e2e.py (3KB)
+**Purpose:** Test end-to-end logging
+
+**Fixture:**
+- `e2e_setup(tmp_path, monkeypatch)`: Setup for e2e test
+
+**Tests:**
+- `test_logging_e2e(e2e_setup)`: Test full logging e2e
+
+**Coverage:** Logging e2e verification
+
+##### test_session_logging.py (2.2KB)
+**Purpose:** Test session logging
+
+**Fixture:**
+- `temp_logs(tmp_path, monkeypatch)`: Temporary logs path
+
+**Tests:**
+- `test_open_session_creates_subdir_and_registry(temp_logs)`: Test session directory and registry creation
+
+**Coverage:** Session logging verification
+
+##### test_process_pending_gui_tasks.py (2.4KB)
+**Purpose:** Test process pending GUI tasks
+
+**Fixture:**
+- `app_instance()`: App instance
+
+**Tests:**
+- `test_redundant_calls_in_process_pending_gui_tasks(app_instance)`: Test no redundant calls
+- `test_gcli_path_updates_adapter(app_instance)`: Test gcli_path updates adapter
+
+**Coverage:** Pending tasks processing verification
+
+##### test_project_manager_tracks.py (2.7KB)
+**Purpose:** Test project manager tracks
+
+**Tests:**
+- `test_get_all_tracks_empty(tmp_path)`: Test get_all_tracks with empty directory
+- `test_get_all_tracks_with_state(tmp_path)`: Test get_all_tracks with state
+- `test_get_all_tracks_with_metadata_json(tmp_path)`: Test get_all_tracks with metadata
+- `test_get_all_tracks_malformed(tmp_path)`: Test get_all_tracks with malformed data
+
+**Coverage:** Project manager tracks verification
+
+##### test_session_logging.py (2.2KB)
+**Purpose:** Test session logging
+
+**Fixture:**
+- `temp_logs(tmp_path, monkeypatch)`: Temporary logs path
+
+**Tests:**
+- `test_open_session_creates_subdir_and_registry(temp_logs)`: Test session directory and registry creation
+
+**Coverage:** Session logging verification
+
+##### test_track_state_persistence.py (3.1KB)
+**Purpose:** Test track state persistence
+
+**Tests:**
+- `test_track_state_persistence(tmp_path)`: Test save/load TrackState
+
+**Coverage:** Track state persistence verification
+
+##### test_track_state_schema.py (6.1KB)
+**Purpose:** Test track state schema
+
+**Tests:**
+- `test_track_state_instantiation()`: Test TrackState instantiation
+- `test_track_state_to_dict()`: Test to_dict() method
+- `test_track_state_from_dict()`: Test from_dict() class method
+- `test_track_state_from_dict_empty_and_missing()`: Test from_dict with empty and missing values
+- `test_track_state_to_dict_with_none()`: Test to_dict with None values
+
+**Coverage:** Track state schema verification
+
+##### test_tree_sitter_setup.py (860 bytes)
+**Purpose:** Test tree-sitter setup
+
+**Tests:**
+- `test_tree_sitter_python_setup()`: Test tree-sitter and tree-sitter-python installed and can parse
+
+**Coverage:** Tree-sitter installation verification
+
+##### test_user_agent.py (633 bytes)
+**Purpose:** Test user agent
+
+**Tests:**
+- `test_user_agent_instantiation()`: Test UserSimAgent instantiation
+- `test_perform_action_with_delay()`: Test action with delay
+
+**Coverage:** User agent verification
+
+##### test_vlogger_availability.py (341 bytes)
+**Purpose:** Test VerificationLogger availability
+
+**Tests:**
+- `test_vlogger_available(vlogger)`: Test VerificationLogger available
+
+**Coverage:** Fixture verification
+
+##### test_workflow_sim.py (1.5KB)
+**Purpose:** Test workflow simulator
+
+**Tests:**
+- `test_simulator_instantiation()`: Test simulator instantiation
+- `test_setup_new_project()`: Test project setup
+- `test_discussion_switching()`: Test discussion switching
+- `test_history_truncation()`: Test history truncation
+
+**Coverage:** Workflow simulation verification
+
+##### test_ai_style_formatter.py (2.8KB)
+**Purpose:** Test AI style formatter
+
+**Tests:**
+- `test_basic_indentation()`: Test 1-space indentation
+- `test_top_level_blank_lines()`: Test max one blank line between top-level definitions
+- `test_inner_blank_lines()`: Test zero blank lines within function bodies
+- `test_multiline_string_safety()`: Test multiline string safety
+- `test_continuation_indentation()`: Test continuation indentation
+- `test_multiple_top_level_definitions(vlogger)`: Test multiple top-level definitions
+
+**Coverage:** Style formatting verification
+## Part 8: Deep Testing & Simulation Architecture Analysis
+
+### Gap 1: No Real-Time Latency Simulation
+
+**Current Implementation:**
+```python
+# simulation/sim_base.py
+time.sleep(random.uniform(0.5, 2.0))  # Fixed delays
+```
+
+**What's Missing:**
+- Variable LLM latency (1-10s)
+- Network latency (100-500ms per request)
+- UI rendering time (16-33ms per frame)
+- Database I/O variance
+
+**Impact:**
+- Tests don't catch timeout issues under real latency
+- Tests don't verify streaming UX (chunked text display)
+- Tests don't measure actual perceived performance
+
+### Gap 2: No Human-Like Behavior Simulation
+
+**Current Implementation:**
+```python
+# simulation/user_agent.py
+class UserSimAgent:
+    def generate_response(self, conversation_history):
+        # Simple heuristic-based responses
+    def perform_action_with_delay(self, action_func):
+        # Execute action with human-like delay
+```
+
+**What's Missing:**
+- Typing speed (50-200 WPM with variability)
+- Hesitation before actions
+- Mistakes (wrong button clicks, editing errors)
+- Reading time for approval dialogs
+- Task switching (window switching, getting distracted)
+
+**Impact:**
+- Tests don't catch UI issues from rapid or erroneous user input
+- Tests don't verify users can recover from mistakes
+- Tests don't measure actual task completion time
+
+### Gap 3: Arbitrary Polling Intervals Miss Transient States
+
+**Current Implementation:**
+```python
+# All simulation tests
+for _ in range(60):  # 1-second polls
+    if condition_met():
+        break
+    time.sleep(1)
+```
+
+**Problem:** States that exist for <1 second never observed.
+
+**Examples of Missed States:**
+- Loading spinner (200-500ms duration)
+- Flickering indicator (50-100ms)
+- Transient error message (300ms duration)
+- Partial state transitions between A and B
+
+**Impact:**
+- UI glitches and race conditions that only manifest briefly never caught.
+
+### Gap 4: Mock CLI Redirection - Subprocess Bypass
+
+**Current Implementation:**
+```python
+# All integration tests
+client.set_value(''gcli_path', mock_cli_path)
+```
+
+**What's Not Tested:**
+- Real subprocess spawning issues (PATH problems, permission errors)
+- Environment variable passing
+- CLI argument parsing and validation
+- Stdin/stderr handling
+- Process cleanup on failure
+
+**Impact:** Integration bugs never caught.
+
+### Gap 5: State Verification is Shallow
+
+**Current Implementation:**
+```python
+# From test_visual_sim_mma_v2.py
+status = client.get_mma_status()
+tickets = status.get('active_tickets', [])
+assert len(tickets) >= 2  # Only checks length
+
+streams = status.get('mma_streams', {})
+if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]:
+    streams_found = True  # Only checks substring
+```
+
+**What's Not Verified:**
+- Ticket structure valid?
+- IDs unique?
+- Dependencies correct?
+- Stream content complete?
+- No errors in output?
+
+**Impact:** Data integrity issues never caught.
+
+### Gap 6: No Stress Testing
+
+**Current Coverage:** None
+
+**Missing Tests:**
+- Load testing with many concurrent requests
+- Edge case bombardment (rapid user input, malformed data)
+- Performance under resource constraints
+
+**Impact:** Resource leaks, race conditions never caught.
+
+## Part 9: Summary Table: Testing Pitfalls by Severity
+
+| Severity | Issue | False Positive Risk | Files Affected |
+|----------|-------|-------------------|----------------|
+| **HIGH** | Mock provider always returns success | All integration tests using mock_gemini_cli.py |
+| **HIGH** | Auto-approval of all HITL gates | test_visual_sim_mma_v2.py, all simulation tests |
+| **HIGH** | Substring-based assertions | All visual and MMA tests |
+| **HIGH** | State existence only, no validation | All MMA and conductor tests |
+| **HIGH** | No negative path testing | Entire test suite |
+| **HIGH** | No state machine validation | All MMA and conductor tests |
+| **HIGH** | No concurrent access testing | Entire test suite |
+| **MEDIUM** | No visual verification | All integration and visual tests |
+| **MEDIUM** | No concurrent access testing | Entire test suite |
+| **MEDIUM** | No real-time latency simulation | Simulation tests |
+| **MEDIUM** | No human-like behavior | Simulation tests |
+| **MEDIUM** | Arbitrary polling intervals miss transient states | All polling tests |
+| **MEDIUM** | Mock CLI bypasses subprocess | All integration tests |
+| **MEDIUM** | Timeout-based testing brittle | All polling tests |
+| **MEDIUM** | State pollution between tests | Tests using reset_ai_client fixture |
+| **MEDIUM** | No visual rendering | All visual tests |
+| **MEDIUM** | No stress testing | Entire test suite |
+| **LOW** | No real-time latency simulation | Simulation tests |
+| **LOW** | No human-like behavior | Simulation tests |
+| **LOW** | Arbitrary polling intervals miss transient states | All polling tests |
+
+**Total High Priority Issues:** 7
+
+**Total Medium Priority Issues:** 10
+
+**Total Low Priority Issues:** 2
+
+---
+
+## Part 10: Architecture Strengths & Design Patterns Observed
+
+### Strengths
+
+1. **Clear Layering:**
+   - 4-tier MMA with explicit boundaries (Tier 1 PM ? Tier 2 Tech Lead ? Tier 3 Worker ? Tier 4 QA)
+   - Tier roles clearly defined with responsibilities
+   - State transitions managed via ConductorEngine
+   - All tiers use same ai_client interface
+
+2. **Decoupled Events:**
+   - EventEmitter for synchronous pub/sub
+   - AsyncEventQueue for cross-thread async communication
+   - Clear separation between GUI thread and asyncio worker thread
+
+3. **HITL Enforcement:**
+   - MUTATING_TOOLS frozenset for identifying dangerous tools
+   - pre_tool_callback routing for user approval
+   - Approval dialogs (ConfirmDialog, MMAApprovalDialog, MMASpawnApprovalDialog) for all destructive actions
+
+4. **Token Management:**
+   - History truncation to prevent token bloat
+   - Tier-scoped context (build_tier[1-3]_context)
+   - Cache TTL for Gemini (3600s, rebuilt at 90%)
+   - Token usage tracking per tier
+
+5. **Multi-Provider Support:**
+   - Unified ai_client.send() interface
+   - Supports Gemini, Anthropic, DeepSeek, Gemini CLI
+   - Provider-specific optimizations (Gemini server-side caching, Anthropic prompt caching)
+
+6. **Comprehensive Testing Infrastructure:**
+   - live_gui fixture for session-scoped GUI lifecycle
+   - kill_process_tree() for clean process cleanup
+   - VerificationLogger for structured test telemetry
+   - Artifact isolation (tests/artifacts/, tests/logs/)
+
+7. **Data-Oriented Design:**
+   - Minimal use of OOP, preference for data structures and functions
+   - Performance-oriented architecture (PerformanceMonitor, frame-sync loops)
+
+### Design Patterns
+
+1. **Observer Pattern:**
+   - EventEmitter for lifecycle events
+   - Listeners registered via events.on()
+
+2. **Strategy Pattern:**
+   - Multi-provider AI client with provider-specific methods
+   - _send_gemini(), _send_anthropic(), _send_deepseek(), _send_gemini_cli()
+
+3. **Factory Pattern:**
+   - generate_tickets() creates Ticket objects
+   - generate_tracks() creates Track objects
+   - Multi-agent_conductor.run_worker_lifecycle() spawns worker context
+
+4. **Command Pattern:**
+   - Action map _process_pending_gui_tasks() for GUI state mutations
+   - Actions triggered by string keys (set_value, click, select_tab, etc.)
+
+5. **Dependency Injection:**
+   - confirm_and_run_callback, comms_log_callback, tool_log_callback injected by GUI
+   - qa_callback injected for Tier 4 QA
+
+6. **Template Method:**
+   - BaseSimulation for all simulation classes
+   - WorkflowSimulator extends BaseSimulation
+
+7. **Data-Oriented Design:**
+   - Minimal use of OOP, preference for data structures and functions
+   - Performance-oriented architecture (PerformanceMonitor, frame-sync loops)
+
+### Potential Concerns
+
+1. **Large Monolithic Files:**
+   - gui_2.py (77.6KB) - Main GUI orchestrator
+   - ai_client.py (70.6KB) - Multi-provider abstraction
+   - app_controller.py (70.1KB) - Headless controller
+   - mcp_client.py (48.2KB) - 26-tool dispatcher
+   - Risk: Difficult to navigate, high maintenance burden
+
+2. **State Duplication:**
+   - App (GUI) and AppController (headless) share logic
+   - Many similar methods between two classes
+   - Violates DRY principle
+
+3. **Global State in Modules:**
+   - ai_client.py uses module-level globals for provider state
+   - Makes testing difficult, prone to state pollution
+   - Hard to reason about thread-safety
+
+4. **MCP Client Complexity:**
+   - 26 tools in single file (mcp_client.py)
+   - Could be grouped by domain (file ops, Python analysis, web tools)
+   - Makes file navigation difficult
+
+5. **Context Amnesia Enforcement:**
+   - Relies on manual ai_client.reset_session() calls
+   - No automation to prevent accidental state pollution
+   - Risk: Workers might accumulate state across tickets
+
+### Cross-Reference with Existing Tracks
+
+### test_stabilization_20260302
+- Overlaps: None
+- This track addresses asyncio errors and mock-rot ban
+- Our audit found: Mock-rot is already structurally banned but enforcement is weak
+- Synergy: This audit identifies specific weaknesses in mock provider that stabil track should address
+
+### codebase_migration_20260302
+- Overlaps: None
+- This track restructures to src/ layout
+- Our audit focuses on testing infrastructure, not directory structure
+- Synergy: Directory restructuring should happen AFTER testing is hardened
+
+### gui_decoupling_controller_20260302
+- Overlaps: None
+- This track extracts state machine from GUI
+- Our audit finds: State duplication between App and AppController
+- Synergy: Decoupling should include test infrastructure hardening
+
+### hook_api_ui_state_verification_20260302
+- Overlaps: None
+- This track adds /api/gui/state GET endpoint
+- Our audit recommends: All tests should use hook server for state verification
+- Synergy: High - this enables automated testing our audit recommends
+
+### robust_json_parsing_tech_lead_20260302
+- Overlaps: None
+- This track adds auto-retry for JSON parsing
+- Our audit found: Mock provider never produces malformed JSON
+- Synergy: Auto-retry won't help if mock always succeeds
+
+### concurrent_tier_source_tier_20260302
+- Overlaps: None
+- This track uses threading.local() for thread-safe logging
+- Our audit found: No concurrent access tests
+- Synergy: High - threading.local() implementation should include comprehensive testing
+
+### test_suite_performance_and_flakiness_20260302
+- Overlaps: High
+- This track replaces time.sleep() with deterministic polling
+- Our audit identified: Arbitrary timeouts make tests brittle
+- Synergy: High - our audit recommends eliminating arbitrary sleeps
+
+### manual_ux_validation_20260302
+- Overlaps: None
+- This track validates GUI UX via simulation feedback
+- Our audit found: Simulations are low-fidelity emulators
+- Synergy: This track depends on simulation framework being improved
+## Part 11: Cross-Reference with Existing Tracks
+
+### test_stabilization_20260302
+- Overlaps: None
+- This track addresses asyncio errors and mock-rot ban
+- Our audit found: Mock-rot is already structurally banned but enforcement is weak
+- Synergy: This audit identifies specific weaknesses in mock provider that stabil track should address
+
+### codebase_migration_20260302
+- Overlaps: None
+- This track restructures to src/ layout
+- Our audit focuses on testing infrastructure, not directory structure
+- Synergy: Directory restructuring should happen AFTER testing is hardened
+
+### gui_decoupling_controller_20260302
+- Overlaps: None
+- This track extracts state machine from GUI
+- Our audit finds: State duplication between App and AppController
+- Synergy: Decoupling should include test infrastructure hardening
+
+### hook_api_ui_state_verification_20260302
+- Overlaps: None
+- This track adds /api/gui/state GET endpoint
+- Our audit recommends: All tests should use hook server for state verification
+- Synergy: High - this enables automated testing our audit recommends
+
+### robust_json_parsing_tech_lead_20260302
+- Overlaps: None
+- This track adds auto-retry for JSON parsing
+- Our audit found: Mock provider never produces malformed JSON
+- Synergy: Auto-retry won't help if mock always succeeds
+
+### concurrent_tier_source_tier_20260302
+- Overlaps: None
+- This track uses threading.local() for thread-safe logging
+- Our audit found: No concurrent access tests
+- Synergy: High - threading.local() implementation should include comprehensive testing
+
+### test_suite_performance_and_flakiness_20260302
+- Overlaps: High
+- This track replaces time.sleep() with deterministic polling
+- Our audit identified: Arbitrary timeouts make tests brittle
+- Synergy: High - our audit recommends eliminating arbitrary sleeps
+
+### manual_ux_validation_20260302
+- Overlaps: None
+- This track validates GUI UX via simulation feedback
+- Our audit found: Simulations are low-fidelity emulators
+- Synergy: This track depends on simulation framework being improved
+## Part 12: Recommendations for Future Tracks
+
+### Priority 1: Fix Mock Provider (HIGH)
+
+**Suggested Track:** "mock_provider_enhancement_20260305"
+
+**Goals:**
+- Add failure modes (timeouts, malformed JSON, rate limits)
+- Add input validation
+- Track tool calls for verification
+- Make mock responses configurable via environment variables
+
+**Files to Modify:**
+- tests/mock_gemini_cli.py
+
+### Priority 2: Fix Auto-Approval Pattern (HIGH)
+
+**Suggested Track:** "approval_ux_enhancement_20260305"
+
+**Goals:**
+- Remove auto-approval from critical path tests
+- Add dialog visibility verification before clicking
+- Add rejection flow tests for all approval types
+- Test approval fatigue scenarios
+
+**Files to Modify:**
+- tests/test_visual_sim_mma_v2.py
+- All simulation tests with approval flows
+
+### Priority 3: Add Negative Testing (HIGH)
+
+**Suggested Track:** "negative_path_testing_20260305"
+
+**Goals:**
+- Create comprehensive negative test suite
+- Test all rejection flows
+- Test error handling (timeouts, network failures, malformed data)
+- Test concurrent access patterns
+- Test out-of-order event sequences
+
+**Files to Modify:**
+- Create tests/test_negative_flows.py
+- Update mock_gemini_cli.py
+
+### Priority 4: Add State Validation (MEDIUM)
+
+**Suggested Track:** "state_validation_enhancement_20260305"
+
+**Goals:**
+- Add schema validation using pydantic
+- Add state machine invariants testing
+- Add thread-safety tests for shared resources
+- Add DAG integrity tests (cycles, self-dependencies)
+
+**Files to Modify:**
+- Create tests/test_schemas.py
+- Update all MMA and conductor tests
+
+### Priority 5: Add Visual Verification (MEDIUM)
+
+**Suggested Track:** "visual_regression_testing_20260305"
+
+**Goals:**
+- Add screenshot comparison infrastructure
+- Test modal dialog visibility
+- Test text overflow and clipping
+- Test layout at different window sizes
+
+**Files to Modify:**
+- Create tests/test_visual_regression.py
+- Create tests/baselines/
+- Update all visual and MMA tests
+
+### Priority 6: Improve Simulation Fidelity (LOW)
+
+**Suggested Track:** "simulation_fidelity_enhancement_20260305"
+
+**Goals:**
+- Add variable latency simulation
+- Add human-like behavior (typing speed, hesitation, mistakes)
+- Add realistic delays (not fixed random values)
+- Add task switching and distraction simulation
+
+**Files to Modify:**
+- simulation/user_agent.py
+- All simulation files
+
+### Priority 7: Consolidate Test Infrastructure (MEDIUM)
+
+**Suggested Track:** "test_infrastructure_consolidation_20260305"
+
+**Goals:**
+- Centralize common test patterns (polling helpers, verification helpers)
+- Improve fixture cleanup to prevent state pollution
+- Add better error reporting and diagnostics
+- Standardize mock patterns across all test suites
+
+**Files to Modify:**
+- tests/conftest.py
+- Create shared helpers in conftest.py
+- Update all tests using new helpers
+
+### Medium-Term Improvements
+
+#### Recommendation 8: Update Structural Testing Contract (MEDIUM)
+
+**Priority:** MEDIUM
+
+**Description:** Add missing rules to docs/guide_simulations.md
+
+**New Rules to Add:**
+1. Every approval dialog type must have tests for rejection flow
+2. All async operations must have timeout/failure tests
+3. Every parser must have malformed input tests
+4. State validation beyond existence checks required
+5. Visual verification required for modal dialogs
+6. Thread-safety testing required for shared resources
+
+**Files to Modify:**
+- docs/guide_simulations.md
+
+### Priority 9: Add Property-Based Testing (MEDIUM)
+
+**Description:** Use Hypothesis for generative testing
+
+**Files to Modify:**
+- tests/test_properties.py
+- Add to requirements: `hypothesis` package
+
+#### Recommendation 10: Add Fuzzing (MEDIUM)
+
+**Description:** Add fuzzing for robustness testing
+
+**Files to Modify:**
+- tests/test_fuzzing.py
+
+### Long-Term Architecture
+
+#### Recommendation 11: Adopt Test-Driven Development
+
+**Description:** Move existing test suite toward TDD methodology
+
+**Current State:** Many tests written after implementation
+
+**Goal:** Write failing tests first, then implement to make them pass
+
+**Benefits:**
+- Ensures code quality
+- Improves test reliability
+- Makes refactoring safe
+
+#### Recommendation 12: Separate Unit and Integration Tests
+
+**Description:** Ensure unit tests don't depend on live_gui fixture
+
+**Current Issue:** Many unit tests use live_gui, making them slow and flaky
+
+**Goal:** Isolate unit logic, use mocks for unit tests, reserve live_gui for true integration tests
+
+**Benefits:**
+- Faster unit test execution
+- Clear separation of concerns
+- More reliable unit test results
+
+---
+
+## Part 13: Conclusion
+
+### Summary of Findings
+
+This audit has revealed **7 critical false positive risks** and **5 major simulation fidelity gaps** that significantly undermine confidence in the test suite:
+
+**Critical False Positive Risks:**
+1. Mock provider always succeeds ? error handling untested
+2. Auto-approval never verifies dialogs ? approval UX untested
+3. Substring assertions only check existence ? data integrity untested
+4. State existence only, no validation ? invariants untested
+5. No negative testing ? rejection flows, errors, edge cases untested
+6. No visual verification ? rendering bugs never caught
+
+**Major Simulation Gaps:**
+1. No real-time latency simulation
+2. No human-like behavior simulation
+3. Arbitrary polling intervals miss transient states
+4. Mock CLI redirection bypasses subprocess
+5. Shallow state verification only
+6. No stress testing
+
+**Impact Assessment:**
+- The test suite provides **good architectural boundary checking** but suffers from **critical gaps in error handling, state validation, and UX verification**
+- The simulation framework is a **rough emulator** that tests happy path only and masks many real-world failure scenarios
+- Existing tracks (stabilization, migration, decoupling) will benefit from this audit's findings
+
+### Next Steps
+
+**Immediate:**
+1. Review this report with project stakeholders
+2. Prioritize recommendations based on severity and impact
+3. Plan implementation tracks for highest priority issues
+
+**For Future Sessions:**
+1. Use this report as reference when planning new tracks
+2. Ensure new tracks address false positive risks identified here
+3. Improve simulation framework before depending on it for critical tests
+
+---
+
+**Report Generation Complete**
+
+This report was generated by GLM-4.7 through comprehensive skeletal analysis of: entire codebase (src/, tests/, simulation/), architecture documentation review, and pattern-based identification of testing anti-patterns and simulation gaps.
+
+The report contains 13 major sections covering:
+- Full module architecture for src/ (26 modules analyzed)
+- Complete test architecture for tests/ (100+ test files analyzed)
+- Simulation framework analysis (9 scripts analyzed)
+- Deep analysis of 7 critical false positive risks with concrete examples
+- Deep analysis of 5 major simulation fidelity gaps
+- Specific test file analysis with code examples
+- 12 prioritized recommendations
+- Architecture strengths, design patterns, and potential concerns
+- Cross-reference with existing tracks
+- Summary table by severity (7 HIGH priority issues, 10 medium priority issues)
+- Long-term architecture recommendations
+- Conclusion and next steps
+
+This exhaustive detail should enable future agents to fully understand findings without needing to re-analyze the codebase.
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md b/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md
new file mode 100644
index 0000000..4e9e0ae
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md
@@ -0,0 +1,562 @@
+# Test Architecture Integrity Audit — Claude Review
+
+**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
+**Review Date:** 2026-03-05
+**Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
+**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
+recommendations for downstream tracks.
+
+**Methodology:**
+1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
+   guide_mma, guide_meta_boundary, Readme)
+2. Read GLM's full report.md
+3. Read plan.md and spec.md for this track
+4. Read py_get_skeleton for all 27 src/ modules
+5. Read py_get_skeleton for conftest.py and representative test files
+   (test_extended_sims, test_live_gui_integration, test_dag_engine,
+   test_mma_orchestration_gui)
+6. Read py_get_skeleton for all 9 simulation/ modules
+7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
+
+---
+
+## Section 1: Verdict on GLM's Report
+
+GLM produced a competent surface-level audit. The structural inventory is
+accurate and the broad categories of weakness (mock-rot, shallow assertions,
+no negative paths) are valid. However, the report has material errors in
+severity classification, contains two exact duplicate sections (Parts 10 and
+11 are identical), and misses several issues that are more impactful than
+the ones it flags at HIGH. It also makes recommendations that are
+architecturally inappropriate for an ImGui immediate-mode application.
+
+**Confirmed correct:** ~60% of findings
+**Overstated or miscategorized:** ~25% of findings
+**Missed entirely:** see Section 3
+
+---
+
+## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
+
+### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
+
+GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
+keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
+default) always produces a well-formed success response. No test using this
+mock can ever exercise:
+- Malformed or truncated JSON-L output
+- Non-zero exit code from the CLI process
+- A `{"type": "result", "status": "error", ...}` result event
+- Rate-limit or quota responses
+- Partial output followed by process crash
+
+The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
+corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
+that bubbles up through `_send_gemini_cli`. This path is entirely untested.
+
+**Severity: HIGH — confirmed.**
+
+### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
+
+GLM flags this as HIGH. The auto-approval pattern in polling loops is:
+```python
+if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
+```
+
+This is structurally correct for automated testing — you MUST auto-approve
+to drive the pipeline. The actual bug is different from what GLM describes:
+the tests never assert that the dialog appeared BEFORE approving. The
+correct pattern is:
+```python
+assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
+client.click('btn_approve_spawn')
+```
+
+Without the assert, the test passes even if the dialog never fires (meaning
+spawn approval is silently bypassed at the application level).
+
+**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
+**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
+required for unattended testing. The fix is to assert the flag is True
+*before* clicking.
+
+There is also zero testing of the rejection path: what happens when
+`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
+Does the track reach "blocked" state? This is an untested state transition.
+
+### 2.3 Confirmed: Assertions Are Shallow (HIGH)
+
+GLM is correct. The two canonical examples from simulation tests:
+```python
+assert len(tickets) >= 2          # structure unknown
+"SUCCESS: Mock Tier 3 worker" in streams[tier3_key]  # substring only
+```
+
+Neither validates ticket schema, ID uniqueness, dependency correctness, or
+that the stream content is actually the full response and not a truncated
+fragment.
+
+**Severity: HIGH — confirmed.**
+
+### 2.4 Confirmed: No Negative Path Testing (HIGH)
+
+GLM is correct. The entire test suite covers only the happy path. Missing:
+- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
+  MMASpawnApprovalDialog)
+- Malformed LLM response handling (bad JSON, missing fields, unexpected types)
+- Network timeout/connection error to Hook API during a live_gui test
+- `shell_runner.run_powershell` timeout (60s) expiry path
+- `mcp_client._resolve_and_check` returning an error (path outside allowlist)
+
+**Severity: HIGH — confirmed.**
+
+### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
+
+GLM is correct. 1-second polling in simulation loops will miss any state
+that exists for less than 1 second. The approval dialogs in particular may
+appear and be cleared within a single render frame if the engine is fast.
+
+The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
+polling target. It is the backbone of all extended simulation tests. If its
+polling strategy is wrong, the entire extended sim suite is unreliable.
+
+**Severity: MEDIUM — confirmed.**
+
+### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
+
+GLM is correct. Setting `gcli_path` to a Python script does not exercise:
+- Real PATH resolution for the `gemini` binary
+- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
+- Environment variable propagation to the subprocess
+- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
+- The `kill_process_tree` teardown path when the process hangs
+
+**Severity: MEDIUM — confirmed.**
+
+### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
+
+**GLM is WRONG here.** In Part 8, GLM lists:
+> "Read-Only Tools: run_powershell (via shell_runner.py)"
+
+`run_powershell` executes arbitrary PowerShell scripts against the filesystem.
+It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
+because it is not an MCP filesystem tool; its approval gate is the
+`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
+is a factual error that could mislead future workers about the security model.
+
+### 2.8 CORRECTION: "State Duplication Between App and AppController"
+
+**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
+before this audit. `gui_2.App` now delegates all state through `AppController`
+via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
+thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
+stale relative to the current codebase.
+
+### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
+
+**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
+These are immediate-mode renderers; there is no DOM or widget tree to
+interrogate. Pixel-level screenshot comparison requires platform-specific
+capture APIs (Windows Magnification, GDI) and is extremely fragile to font
+rendering, DPI, and GPU differences. The Hook API's logical state verification
+is the CORRECT and SUFFICIENT abstraction for this application. Adding
+screenshot comparison would be high cost, low value, and high flakiness.
+
+The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
+is exposing more GUI state via the Hook API so tests can assert logical
+rendering state (is a panel visible? what is the modal title?) without pixels.
+
+### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
+
+The summary table in Part 9 lists identical items at multiple severity levels:
+- "No concurrent access testing": appears as both HIGH and MEDIUM
+- "No real-time latency simulation": appears as both MEDIUM and LOW
+- "No human-like behavior": appears as both MEDIUM and LOW
+- "Arbitrary polling intervals": appears as both MEDIUM and LOW
+
+Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
+section was copy-pasted in full. This suggests the report was generated with
+insufficient self-review.
+
+### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
+
+GLM spends substantial space on the absence of:
+- Typing speed simulation
+- Hesitation before actions
+- Variable LLM latency
+
+This is a **personal developer tool for a single user on a local machine**.
+These are aspirational concerns for a production SaaS simulation framework.
+For this product context, these are genuinely LOW priority. The simulation
+framework's job is to verify that the GUI state machine transitions correctly,
+not to simulate human psychology.
+
+---
+
+## Section 3: Issues GLM Missed
+
+These are findings not present in GLM's report that carry meaningful risk.
+
+### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
+
+`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
+tests that use `live_gui` share a single running GUI process. If test A
+leaves the GUI in a state with an open modal dialog, test B will find the
+GUI unresponsive or in an unexpected state.
+
+The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
+but this clears AI state and discussion history, not pending dialogs or
+MMA orchestration state. A test that triggers a spawn approval dialog and
+then fails before approving it will leave `_pending_mma_spawn` set, blocking
+the ENTIRE remaining test session.
+
+**Severity: HIGH.** The current test ordering dependency is invisible and
+fragile. Tests must not be run in arbitrary order.
+
+**Fix:** Each `live_gui`-using test that touches MMA or approval flows should
+explicitly verify clean state at start:
+```python
+status = client.get_mma_status()
+assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
+```
+
+### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
+
+The `app_instance` fixture mocks out all ImGui rendering. This means every
+test using `app_instance` (approximately 40+ tests) is testing Python object
+state, not rendered UI. Tests like:
+- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
+  not that the panel renders
+- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
+  `_render_token_budget_panel()` in a context where all ImGui calls are no-ops
+
+This creates a systematic false-positive class: a method can be completely
+broken (wrong data, missing widget calls) and the test passes because ImGui
+calls are silently ignored. The only tests with genuine rendering fidelity
+are the `live_gui` tests.
+
+This is the root cause behind GLM's "state existence only" finding. It is
+not a test assertion weakness — it is a fixture architectural limitation.
+
+**Severity: HIGH.** The implication: all `app_instance`-based rendering
+tests should be treated as "smoke tests that the method doesn't crash,"
+not as "verification that the rendering is correct."
+
+**Fix:** The `hook_api_ui_state_verification_20260302` track (adding
+`/api/gui/state`) is the correct path forward: expose render-visible state
+through the Hook API so `live_gui` tests can verify it.
+
+### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
+
+`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
+There is no outer timeout on this loop. If the GUI thread never signals the
+dialog (e.g., GUI crash after dialog creation, or a test that creates a
+dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
+
+This is particularly dangerous in the `run_worker_lifecycle` path:
+1. Worker pushes dialog to event queue
+2. GUI process crashes or freezes
+3. `dialog.wait()` loops forever at 0.1s intervals
+4. Test session hangs with no error output
+
+There is no test verifying that `wait()` has a maximum wait time and raises
+an exception or returns a default (rejected) decision after it.
+
+**Severity: HIGH.**
+
+### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
+
+`mcp_client.configure()` sets module-level globals (`_allowed_paths`,
+`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
+directly without calling `configure()` first will use whatever state was
+left from the previous test. The `reset_ai_client` autouse fixture calls
+`ai_client.reset_session()` but does NOT reset `mcp_client` state.
+
+Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
+etc. directly (not through `ai_client.send()`) inherits the allowlist from
+the previous test run. This can cause false passes (path permitted by
+previous test's allowlist) or false failures (path denied because
+`_base_dirs` is empty from a prior reset).
+
+**Severity: MEDIUM.**
+
+### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
+
+GLM mentions this as a "design concern." It is more specific: the
+`concurrent_tier_source_tier_20260302` track exists because `current_tier`
+in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
+run concurrently (future feature), the second `send()` call will overwrite
+the first worker's tier tag.
+
+What's missing: there is no test that verifies the CURRENT behavior is safe
+under single-threaded operation, and no test that demonstrates the failure
+mode under concurrent operation to serve as a regression baseline for the fix.
+
+**Severity: MEDIUM.**
+
+### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
+
+The arch boundary tests verify that `manual_slop.toml` lists mutating tools
+as disabled by default. But the tests don't verify:
+1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
+   at startup
+2. That `ai_client._agent_tools` is actually consulted before tool dispatch
+3. That the TOML → runtime path is end-to-end
+
+A developer could modify how tools are loaded without breaking these tests.
+The tests are static config audits, not runtime enforcement tests.
+
+**Severity: MEDIUM.**
+
+### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
+
+From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
+and calls `ai_client.send()` to generate "human-like" responses. This means:
+- Simulation tests have an implicit dependency on a configured LLM provider
+- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
+  level, not at the GUI level — making failures hard to diagnose
+- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
+  the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
+  would attempt real API calls
+
+No test documents whether UserSimAgent is actually exercised in the extended
+sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
+directly to drive the GUI.
+
+**Severity: MEDIUM.**
+
+### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
+
+The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
+and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
+`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
+`'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
+
+This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
+handles `tool_use` event types and accumulates them) is NEVER exercised by
+any test. A regression in that parsing branch would be invisible to the
+test suite.
+
+**Severity: LOW** (only relevant when the real gemini CLI is used with tools).
+
+### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
+
+The `reset_ai_client` autouse fixture runs synchronously before each test.
+For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
+async setup. If the async test itself triggers ai_client operations in setup
+(e.g., through an event loop created by the fixture), the reset may not
+capture all state mutations. This is an edge case but could explain
+intermittent behavior in async tests.
+
+**Severity: LOW.**
+
+---
+
+## Section 4: Revised Severity Matrix
+
+| Severity | Finding | GLM? | Source |
+|---|---|---|---|
+| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
+| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
+| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
+| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
+| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
+| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
+| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
+| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
+| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
+| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
+| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
+| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
+| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
+| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
+| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
+| **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
+
+---
+
+## Section 5: Prioritized Recommendations for Downstream Tracks
+
+Listed in execution order, not importance order. Each maps to an existing or
+proposed track.
+
+### Rec 1: Extend mock_gemini_cli with Failure Modes
+**Target track:** New — `mock_provider_hardening_20260305`
+**Files:** `tests/mock_gemini_cli.py`
+**What:** Add a `MOCK_MODE` environment variable selector:
+- `success` (current behavior, default)
+- `malformed_json` — emit a truncated/corrupt JSON-L line
+- `error_result` — emit `{"type": "result", "status": "error", ...}`
+- `timeout` — sleep 90s to trigger the CLI timeout path
+- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
+
+Tests that need to verify error handling pass `MOCK_MODE=error_result` via
+`client.set_value()` before triggering the AI call.
+
+### Rec 2: Add Dialog Assertion Before Auto-Approval
+**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
+**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
+**What:** Replace the conditional approval pattern:
+```python
+# BAD (current):
+if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
+# GOOD:
+assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
+client.click('btn_approve_spawn')
+```
+Also add at least one test per dialog type that clicks reject and asserts the
+correct downstream state (engine marks track blocked, no worker spawned, etc.).
+
+### Rec 3: Fix live_gui Session Scope Dirty State
+**Target track:** `test_suite_performance_and_flakiness_20260302`
+**Files:** `tests/conftest.py`
+**What:** Add a per-test autouse fixture (function-scoped) that asserts clean
+GUI state before each `live_gui` test:
+```python
+@pytest.fixture(autouse=True)
+def assert_gui_clean(live_gui):
+    client = ApiHookClient()
+    status = client.get_mma_status()
+    assert not status.get('pending_mma_spawn_approval')
+    assert not status.get('pending_mma_step_approval')
+    assert not status.get('pending_tool_approval')
+    assert status.get('mma_status') in ('idle', 'done', '')
+```
+This surfaces inter-test pollution immediately rather than causing a
+mysterious hang in a later test.
+
+### Rec 4: Add ConfirmDialog Timeout Test
+**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
+**Files:** `tests/test_conductor_engine.py`
+**What:** Add a test that creates a `ConfirmDialog`, never signals it, and
+verifies after N seconds that the background thread does NOT block indefinitely.
+This requires either a hard timeout on `wait()` or a documented contract that
+callers must signal the dialog within a finite window.
+
+### Rec 5: Expose More State via Hook API
+**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
+**Files:** `src/api_hooks.py`
+**What:** This track is the key enabler for replacing `app_instance` rendering
+tests with genuine state verification. The planned `/api/gui/state` endpoint
+should expose:
+- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
+- `ui_focus_agent` current filter value
+- `_mma_status`, `_ai_status` text values
+- Panel visibility flags
+
+Once this is in place, the `app_instance` rendering tests can be migrated
+to `live_gui` equivalents that actually verify GUI-visible state.
+
+### Rec 6: Add mcp_client Reset to autouse Fixture
+**Target track:** `test_suite_performance_and_flakiness_20260302`
+**Files:** `tests/conftest.py`
+**What:** Extend `reset_ai_client` autouse fixture to also call
+`mcp_client.configure([], [])` to clear the allowlist between tests.
+This prevents allowlist state from a previous test from leaking into the next.
+
+### Rec 7: Add Runtime HITL Enforcement Test
+**Target track:** `test_suite_performance_and_flakiness_20260302` or new
+**Files:** `tests/test_arch_boundary_phase2.py`
+**What:** Add an integration test (using `app_instance`) that:
+1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
+2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
+3. Triggers a dispatch of `set_file_slice`
+4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
+
+This closes the gap between "config says mutating tools are off" and
+"runtime actually gates them through the approval callback."
+
+### Rec 8: Document `app_instance` Limitation in conftest
+**Target track:** Any ongoing work — immediate, no track needed
+**Files:** `tests/conftest.py`
+**What:** Add a docstring to `app_instance` fixture:
+```python
+"""
+App instance with all ImGui rendering calls mocked to no-ops.
+Use for unit tests of state logic and method existence.
+DO NOT use to verify rendering correctness — use live_gui for that.
+"""
+```
+This prevents future workers from writing rendering tests against this fixture
+and believing they have real coverage.
+
+---
+
+## Section 6: What the Existing Track Queue Gets Right
+
+The `TASKS.md` strict execution queue is well-ordered for the test concerns:
+
+1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
+2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
+3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
+4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
+5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
+6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
+   malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
+7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
+8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
+
+The `test_architecture_integrity_audit_20260304` (this track) sits logically
+between #1 and #5 — it provides the analytical basis for what #5 and #8 need
+to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
+for both those tracks.
+
+The proposed new tracks (mock_provider_hardening, negative_path_testing) from
+GLM's recommendations are valid but should be created AFTER track #5
+(`hook_api_ui_state_verification`) is complete, since they depend on the
+richer Hook API state to write meaningful assertions.
+
+---
+
+## Section 7: Architectural Observations Not in GLM's Report
+
+### The Two-Tier Mock Problem
+
+The test suite has two completely separate mock layers that do not know about
+each other:
+
+**Layer 1** — `app_instance` fixture (in-process): Patches `immapp.run()`,
+`ai_client.send()`, and related functions with `unittest.mock`. Tests call
+methods directly. No network, no subprocess, no real threading.
+
+**Layer 2** — `mock_gemini_cli.py` (out-of-process): A fake subprocess that
+the live GUI process calls through its own internal LLM pipeline. Tests drive
+this via `ApiHookClient` HTTP calls to the running GUI process.
+
+These layers test completely different things. Layer 1 tests Python object
+invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
+process management). Most of the test suite is Layer 1. Very few tests are
+Layer 2. The high-value tests are Layer 2 because they exercise the actual
+system, not a mock of it.
+
+GLM correctly identifies that Layer 1 tests are of limited value for
+rendering verification but does not frame it as a two-layer architecture
+problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
+
+### The Simulation Framework's Actual Role
+
+The `simulation/` module is not (and should not be) a fidelity benchmark.
+Its role is:
+1. Drive the GUI through a sequence of interactions
+2. Verify the GUI reaches expected states after each interaction
+
+The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
+`sim_execution.py`) are extremely thin wrappers. Their actual test value
+comes from `test_extended_sims.py` which calls them against a live GUI and
+verifies no exceptions are thrown. This is essentially a smoke test for the
+GUI lifecycle, not a behavioral verification.
+
+The real behavioral verification is in `test_visual_sim_mma_v2.py` and
+similar files that assert specific state transitions. The simulation/
+module should be understood as "workflow drivers," not "verification modules."
+
+GLM's recommendation to add latency simulation and human-like behavior to
+`simulation/user_agent.py` would add complexity to a layer that isn't the
+bottleneck. The bottleneck is assertion depth in the polling loops, not
+realism of the user actions.
+
+---
+
+*End of report. Next action: Tier 2 Tech Lead to read this alongside
+`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
+as the highest-leverage unblocking action.*
diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md b/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md
new file mode 100644
index 0000000..1a43afe
--- /dev/null
+++ b/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md
@@ -0,0 +1,96 @@
+# Track Specification: Test Architecture Integrity & Simulation Audit
+
+## Overview
+Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators.
+
+## Current State Audit (as of 20260304)
+
+### Already Implemented (DO NOT re-implement)
+- **Testing Infrastructure** (	ests/conftest.py):
+  - live_gui fixture for session-scoped GUI lifecycle management
+  - Process cleanup with kill_process_tree()
+  - VerificationLogger for diagnostic logging
+  - Artifact isolation to 	ests/artifacts/ and 	ests/logs/
+
+- **Simulation Framework** (simulation/):
+  - sim_base.py: Base simulation class with setup/teardown
+  - workflow_sim.py: Workflow orchestration
+  - sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py
+  - user_agent.py: Simulated human agent
+- **Testing Infrastructure** (tests/conftest.py):
+  - live_gui fixture for session-scoped GUI lifecycle management
+  - Process cleanup with kill_process_tree()
+  - VerificationLogger for diagnostic logging
+  - Artifact isolation to tests/artifacts/ and tests/logs/
+  - Ban on arbitrary core mocking
+- **Mock Provider** (tests/mock_gemini_cli.py):
+  - Keyword-based response routing
+  - JSON-L protocol matching real CLI output
+
+#### Critical False Positive Risks Identified
+1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths
+2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows
+3. **Substring-Based Assertions**: Only check existence of content, not validity or structure
+4. **State Existence Only**: Tests check fields exist but not their correctness or invariants
+5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access
+6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered
+7. **No State Machine Validation**: No verification that status transitions are legal or complete
+
+#### Simulation Rigor Gaps Identified
+1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency
+2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching
+3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states
+4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths
+5. **No Stress Testing**: No load testing, no edge case bombardment
+
+#### Test Coverage Gaps
+- No tests for approval dialog rejection flows
+- No tests for malformed LLM response handling
+- No tests for network timeout/failure scenarios
+- No tests for concurrent duplicate requests
+- No tests for out-of-order event sequences
+- No thread-safety tests for shared resources
+- No visual rendering verification (modal visibility, text overflow, color schemes)
+
+#### Structural Testing Contract Gaps
+- Missing rule requiring negative path testing
+- Missing rule requiring state validation beyond existence
+- Missing rule requiring visual verification
+- No enforcement for thread-safety testing
+
+## Goals
+
+1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW)
+2. Create actionable recommendations for each identified issue
+3. Map existing test coverage gaps to specific missing test files
+4. Provide architecture recommendations for simulation framework enhancements
+
+## Functional Requirements
+
+- [ ] Document all false positive risks in a structured format
+- [ ] Document all simulation fidelity gaps in a structured format
+- [ ] Create severity matrix for each issue
+- [ ] Generate list of missing test cases by category
+- [ ] Provide concrete examples of how current tests would pass despite bugs
+- [ ] Provide concrete examples of how simulations would miss UX issues
+
+## Non-Functional Requirements
+
+- Report must include author attribution (GLM-4.7) and derivation methodology
+- Analysis must cite specific file paths and line numbers where applicable
+- Recommendations must be prioritized by impact and implementation effort
+
+## Architecture Reference
+
+Refer to:
+- docs/guide_simulations.md - Current simulation contract and patterns
+- docs/guide_mma.md - MMA orchestration architecture
+- docs/guide_architecture.md - Thread domains, event system, HITL mechanism
+- conductor/tracks/*/spec.md - Existing track specifications for consistency
+
+## Out of Scope
+
+- Implementing the actual test fixes (that's for subsequent tracks)
+- Refactoring the simulation framework (documenting only)
+- Modifying the mock provider (analyzing only)
+- Writing new tests (planning phase for future tracks)