From 983538aa8b5b83fdc452cccfcdec537d454460b6 Mon Sep 17 00:00:00 2001 From: Ed_ Date: Thu, 5 Mar 2026 00:31:55 -0500 Subject: [PATCH] reports and potential new track --- TASKS.md | 6 +- .../index.md | 3 + .../metadata.json | 9 + .../plan.md | 33 + .../report.md | 2303 +++++++++++++++++ .../report_claude.md | 562 ++++ .../spec.md | 96 + 7 files changed, 3011 insertions(+), 1 deletion(-) create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/index.md create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/plan.md create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/report.md create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md create mode 100644 conductor/tracks/test_architecture_integrity_audit_20260304/spec.md diff --git a/TASKS.md b/TASKS.md index 5cab1a6..7d40b2f 100644 --- a/TASKS.md +++ b/TASKS.md @@ -79,4 +79,8 @@ **Goal:** Elevate Tier 4 from a log summarizer to an auto-patcher. When a verification test fails, Tier 4 generates a `.patch` file. The GUI intercepts this and presents a side-by-side Diff Viewer. The user clicks "Apply Patch" to instantly resume the pipeline. ### 5. Transitioning to a Native Orchestrator -**Goal:** Absorb the Conductor extension entirely into the core application. Manual Slop should natively read/write `plan.md`, manage the `metadata.json`, and orchestrate the MMA tiers in pure Python, removing the dependency on external CLI shell executions (`mma_exec.py`). \ No newline at end of file +**Goal:** Absorb the Conductor extension entirely into the core application. Manual Slop should natively read/write `plan.md`, manage the `metadata.json`, and orchestrate the MMA tiers in pure Python, removing the dependency on external CLI shell executions (`mma_exec.py`). +### 10. est_architecture_integrity_audit_20260304 (Planned) +- **Status:** Initialized +- **Priority:** High +- **Goal:** Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. Documented by GLM-4.7 via full skeletal analysis of src/, tests/, and simulation/ directories. diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/index.md b/conductor/tracks/test_architecture_integrity_audit_20260304/index.md new file mode 100644 index 0000000..3612454 --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/index.md @@ -0,0 +1,3 @@ +# Test Architecture Integrity & Simulation Audit + +[Specification](spec.md) | [Plan](plan.md) diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json b/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json new file mode 100644 index 0000000..d86b79b --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/metadata.json @@ -0,0 +1,9 @@ +{ + "id": "test_architecture_integrity_audit_20260304"`, + "name": "Test Architecture Integrity & Simulation Audit"`, + "status": "planned", + "created_at": "2026-03-04T00:00:00Z", + "updated_at": "2026-03-04T00:00:00Z", + "type": "audit", + "severity": "high" +} diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md b/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md new file mode 100644 index 0000000..22707b0 --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/plan.md @@ -0,0 +1,33 @@ +# Implementation Plan + +## Phase 1: Documentation (Planning) +Focus: Create comprehensive audit documentation with severity ratings + +- [ ] Task 1.1: Document all identified false positive risks with severity matrix +- [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis +- [ ] Task 1.3: Create mapping of coverage gaps to test categories +- [ ] Task 1.4: Provide concrete false positive examples +- [ ] Task 1.5: Provide concrete simulation miss examples +- [ ] Task 1.6: Prioritize recommendations by impact/effort matrix + +## Phase 2: Review & Validation (Research) +Focus: Peer review of audit findings + +- [ ] Task 2.1: Review existing tracks for overlap with this audit +- [ ] Task 2.2: Validate severity ratings against actual bug history +- [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract +- [ ] Task 2.4: Identify which gaps should be addressed in which future track + +## Phase 3: Track Finalization +Focus: Prepare for downstream implementation tracks + +- [ ] Task 3.1: Create prioritized backlog of implementation recommendations +- [ ] Task 3.2: Map recommendations to appropriate future tracks +- [ ] Task 3.3: Document dependencies between this audit and subsequent work + +## Phase 4: User Manual Verification (Protocol in workflow.md) +Focus: Human review of audit findings + +- [ ] Task 4.1: Review severity matrix for accuracy +- [ ] Task 4.2: Validate concrete examples against real-world scenarios +- [ ] Task 4.3: Approve recommendations for implementation diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/report.md b/conductor/tracks/test_architecture_integrity_audit_20260304/report.md new file mode 100644 index 0000000..a45a3f6 --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/report.md @@ -0,0 +1,2303 @@ +# Manual Slop Testing & Simulation Architecture Analysis + +**Author:** GLM-4.7 + +**Analysis Date:** 2026-03-04 + +**Derivation Methodology:** +1. Performed full skeletal summary of all Python files in `./src/` (28 modules, ~400KB total) +2. Performed full skeletal summary of all Python test files in `./tests/` (100+ test files) +3. Performed full skeletal summary of all simulation files in `./simulation/` (9 scripts) +4. Read all architecture documentation in `./docs/` (guide_simulations.md, guide_mma.md, guide_architecture.md) +5. Analyzed test infrastructure patterns against structural testing contract requirements +6. Identified gaps between contract requirements and actual implementation +7. Mapped potential false positive scenarios across mock provider, auto-approval, and assertion patterns +8. Evaluated simulation framework fidelity against real UX requirements +9. Cross-referenced findings with existing tracks in `conductor/tracks/` + +--- + +## Executive Summary + +**Critical Finding:** The test suite has significant **false positive risks** and **simulation gaps** that could mask bugs and UX issues. The current approach is a **low-resolution emulator** rather than a high-fidelity puppet of real user experience. + +**Key Issues Identified:** +1. Mock provider always returns success ? tests pass even if real LLM would fail +2. Auto-approval of all HITL gates ? tests never verify approval UX works +3. Substring-based assertions ? tests pass even if output is malformed +4. No state validation ? tests check existence, not correctness +5. No negative path testing ? error handling never verified +6. No visual verification ? UI rendering bugs never caught + +--- + +## Part 1: Module Architecture Analysis (src/) +### Complete File Inventory and Functionality + +**Core Infrastructure (3 modules, ~23KB)** + +#### 1. events.py (2.6KB) +**Purpose:** Decoupled event system for cross-module communication + +**Classes:** +- `EventEmitter`: Synchronous pub/sub pattern + - `on(event_name, callback)`: Register callback for event + - `emit(event_name, *args, **kwargs)`: Execute all callbacks + - No thread safety - assumes single-threaded usage + +- `AsyncEventQueue`: Async queue-based communication + - `put(event_name, payload)`: Enqueue event + - `get()`: Retrieve event tuple (event_name, payload) + - Uses `asyncio.Queue` internally + +- `UserRequestEvent`: Typed payload for AI requests + - Fields: prompt, stable_md, file_items, disc_text, base_dir + - `to_dict()`: Serialize to dictionary format + +**Usage Pattern:** Used by ai_client for lifecycle hooks, by App for event routing, by multi_agent_conductor for state broadcasting. + +#### 2. models.py (6.9KB) +**Purpose:** Core data structures for MMA orchestration + +**Data Classes:** +- `Ticket`: Atomic unit of work + - Fields: id, description, status (todo|in_progress|completed|blocked), assigned_to, target_file, context_requirements, depends_on, blocked_reason, step_mode, retry_count + - Methods: `mark_blocked(reason)`, `mark_complete()`, `get(key, default)`, `to_dict()`, `from_dict(data)` + +- `Track`: Collection of tickets with shared goal + - Fields: id, description, tickets list + - Methods: `get_executable_tickets()` - returns todo tickets with all deps completed + +- `WorkerContext`: Context for Tier 3 agents + - Fields: ticket_id, model_name, messages list + +- `TrackState`: Persistence schema for track state + - Fields: metadata (Metadata object), discussion list, tasks list (Ticket objects) + - Methods: `to_dict()`, `from_dict(data)` + +- `Metadata`: Track metadata + - Fields: id, name, status, created_at, updated_at + - Methods: `to_dict()`, `from_dict(data)` + +**Constants:** +- `DISC_ROLES`: ["User", "AI", "Vendor API", "System", "Reasoning"] +- `AGENT_TOOL_NAMES`: List of 26 MCP tool names +- `CONFIG_PATH`: Path to config.toml + +**Usage Pattern:** Used throughout MMA system for state management and persistence. + +#### 3. api_hook_client.py (9.2KB) +**Purpose:** IPC client for hook server communication + +**Class: `ApiHookClient`** +- `__init__(base_url, max_retries, retry_delay)`: Initialize client +- `wait_for_server(timeout)`: Poll /status until ready +- `_make_request(method, endpoint, data, timeout)`: Internal request wrapper with retry logic +- `get_status()`: Check health of hook server +- `get_project()`: Retrieve project data +- `post_project(project_data)`: Update project data +- `get_session()`: Retrieve session data +- `get_mma_status()`: Retrieve current MMA status (track, tickets, tier, streams) +- `push_event(event_type, payload)`: Push event to GUI'"'" AsyncEventQueue +- `get_performance()`: Retrieve UI performance metrics +- `post_session(session_entries)`: Update session data +- `post_gui(gui_data)`: Update GUI state +- `select_tab(tab_bar, tab)`: Switch to specific tab +- `select_list_item(listbox, item_value)`: Select item in listbox +- `set_value(item, value)`: Set GUI field value +- `get_value(item)`: Get GUI field value +- `get_text_value(item_tag)`: Get string representation of field +- `get_node_status(node_tag)`: Get DAG node status +- `click(item, *args, **kwargs)`: Simulate button click +- `get_indicator_state(tag)`: Check indicator visibility +- `get_events()`: Fetch and clear event queue +- `wait_for_event(event_type, timeout)`: Poll for specific event +- `wait_for_value(item, expected, timeout)`: Poll until value matches +- `reset_session()`: Simulate clicking Reset Session +- `request_confirmation(tool_name, args)`: Blocking approval request + +**Usage Pattern:** Primary interface for test automation via Hook API. + +#### 4. api_hooks.py (14.9KB) +**Purpose:** HTTP server for exposing internal state to external automation + +**Classes:** +- `HookServerInstance`: Custom ThreadingHTTPServer carrying App reference + - Manages server lifecycle + - Delegates to HookHandler for request processing + +- `HookHandler(BaseHTTPRequestHandler)`: Handles HTTP requests + - `do_GET()`: Handle GET requests + - `/status`: Return server health and basic state + - `/project`: Return project configuration + - `/session`: Return session data + - `/gui/state`: Return full GUI state (for test automation) + - `/diagnostics`: Return performance metrics + - `do_POST()`: Handle POST requests + - `/project`: Update project configuration + - `/session`: Update session data + - `/api/ask`: Trigger AI request with approval + - `/gui`: Execute GUI actions (set_value, click, select_tab, etc.) + - `/resolve_pending_action`: Resolve approval dialogs + +**Usage Pattern:** Provides REST API for headless mode and test automation. + +#### 5. performance_monitor.py (4.2KB) +**Purpose:** Telemetry for UI performance monitoring + +**Class: `PerformanceMonitor`** +- `__init__()`: Initialize metrics and start CPU monitoring thread +- `_monitor_cpu()`: Background thread sampling CPU usage every 1s +- `start_frame()`: Record frame start time +- `record_input_event()`: Track time since last input event +- `start_component(name)`: Start timing a UI component +- `end_component(name)`: End timing a UI component +- `end_frame()`: Calculate FPS and frame time +- `_check_alerts()`: Check for performance degradation (low FPS, high frame time, high input lag) +- `get_metrics()`: Return dict with fps, frame_time_ms, cpu_percent, input_lag_ms +- `stop()`: Stop monitoring thread + +**Usage Pattern:** Integrated into App for real-time performance tracking. + +### AI Integration Layer (2 modules, ~75KB) + +#### 6. ai_client.py (70.6KB) +**Purpose:** Multi-provider LLM abstraction + +**Module-Level State:** +- `_provider`: "gemini" | "anthropic" | "deepseek" | "gemini_cli" +- `_model`: Current model name +- `_temperature`, `_max_tokens`: Model parameters +- `_history_trunc_limit`: Character limit for history truncation (8000) +- `events`: EventEmitter for lifecycle hooks +- `_send_lock`: threading.Lock to serialize send() calls +- `MAX_TOOL_ROUNDS`: Maximum tool call loop iterations (10) +- `_MAX_TOOL_OUTPUT_BYTES`: Cumulative tool output budget (500KB) +- `_ANTHROPIC_CHUNK_SIZE`: Max chars per text block (120,000) +- `_ANTHROPIC_MAX_PROMPT_TOKENS`: Anthropic limit (180,000) +- `_GEMINI_MAX_INPUT_TOKENS`: Gemini limit (900,000) +- `_GEMINI_CACHE_TTL`: Cache TTL in seconds (3600, rebuilt at 90%) + +**Per-Provider Clients:** +- `_gemini_client`: genai.Client (SDK-managed stateful chat) +- `_gemini_chat`: Holds history internally +- `_gemini_cache`: Server-side CachedContent +- `_gemini_cache_md_hash`: Hash for cache invalidation +- `_gemini_cache_created_at`: Cache creation timestamp +- `_anthropic_client`: anthropic.Anthropic (client-managed history) +- `_anthropic_history`: List of message dicts (client-managed) +- `_anthropic_history_lock`: threading.Lock +- `_deepseek_client`: Raw requests HTTP client +- `_deepseek_history`: List of message dicts (client-managed) +- `_deepseek_history_lock`: threading.Lock +- `_gemini_cli_adapter`: GeminiCliAdapter (subprocess wrapper) + +**Callback Injections:** +- `confirm_and_run_callback`: Set by gui.py/app_controller.py - called when AI wants to run command +- `comms_log_callback`: Set by gui.py/app_controller.py - called when comms entry appended +- `tool_log_callback`: Set by gui.py/app_controller.py - called when tool call completes +- `current_tier`: Set by caller tiers - used for comms tagging + +**Key Functions:** +- `set_model_params(temp, max_tok, trunc_limit)`: Update model parameters +- `get_history_trunc_limit()`, `set_history_trunc_limit(val)`: Get/set trunc limit +- `cleanup()`: Destroy all provider clients +- `reset_session()`: Reset all provider state +- `get_gemini_cache_stats()`: Return cache stats dict +- `list_models(provider)`: List available models for provider +- `set_provider(provider, model)`: Switch provider and model +- `get_provider()`: Return current provider +- `set_agent_tools(tools)`: Set enabled tools +- `_build_anthropic_tools()`: Build tool schemas for Anthropic +- `_get_anthropic_tools()`: Get cached Anthropic tools +- `_gemini_tool_declaration()`: Build tool declaration for Gemini +- `_run_script(script, base_dir, qa_callback)`: Execute PowerShell via shell_runner +- `_truncate_tool_output(output)`: Truncate tool output at char limit +- `_reread_file_items(file_items)`: Check mtimes and rebuild changed file diffs +- `_build_file_context_text(file_items)`: Build context text from file items +- `_build_file_diff_text(changed_items)`: Build diff text for changed files +- `_content_block_to_dict(block)`: Convert content block to dict +- `_ensure_gemini_client()`: Initialize Gemini client if needed +- `_get_gemini_history_list(chat)`: Extract history list from chat +- `_send_gemini(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Gemini SDK +- `_send_gemini_cli(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Gemini CLI adapter +- `_send_anthropic(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via Anthropic SDK +- `_estimate_message_tokens(msg)`: Estimate token count for message +- `_invalidate_token_estimate(msg)`: Clear token estimate cache +- `_estimate_prompt_tokens(system_blocks, history)`: Estimate total prompt tokens +- `_strip_stale_file_refreshes(history)`: Remove old [FILES UPDATED] blocks +- `_trim_anthropic_history(system_blocks, history)`: Trim Anthropic history at 180k limit +- `_ensure_anthropic_client()`: Initialize Anthropic client if needed +- `_chunk_text(text, chunk_size)`: Split text into chunks +- `_build_chunked_context_blocks(md_content)`: Build chunked context for Anthropic +- `_strip_cache_controls(history)`: Remove cache control headers +- `_add_history_cache_breakpoint(history)`: Add cache breakpoint +- `_repair_anthropic_history(history)`: Repair malformed Anthropic history +- `_send_deepseek(md_content, user_message, base_dir, file_items, discussion_history, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Send via DeepSeek HTTP +- `run_tier4_analysis(stderr)`: Call Tier 4 QA agent with error +- `get_token_stats(md_content)`: Return token usage statistics +- `send(md_content, user_message, base_dir, file_items, discussion_history, stream, pre_tool_callback, qa_callback, enable_tools, stream_callback)`: Main dispatcher +- `_add_bleed_derived(d, sys_tok, tool_tok)`: Add derived stats to comms entry +- `get_history_bleed_stats(md_content)`: Return history bleed statistics + +**Usage Pattern:** Unified interface for all LLM providers, handles tool calling, streaming, history management, and caching. + +#### 7. gemini_cli_adapter.py (4.6KB) +**Purpose:** Subprocess bridge for Gemini CLI + +**Class: `GeminiCliAdapter` +- `__init__(binary_path)`: Initialize with path to gemini binary +- `send(message, safety_settings, system_instruction, model, stream_callback)`: + - Send message to CLI via stdin + - Parse streaming JSON output line-by-line + - Handle tool_use events (continue reading until final result) + - Extract usage metadata from result event + - Call stream_callback for each message chunk +- `count_tokens(contents)`: Estimate tokens (4 chars/token) +- `__enter__`, `__exit__`: Context manager interface (not implemented) + +**Usage Pattern:** Bridges GUI to Gemini CLI subprocess for headless provider support. +### MCP Tools Bridge (1 module, ~48KB) + +#### 8. mcp_client.py (48.2KB) +**Purpose:** MCP-style file context tools with security restrictions + +**Module-Level State:** +- `_allowed_paths`: Set of resolved absolute Path objects (files or dirs) +- `_base_dirs`: Set of resolved absolute Path dirs that act as roots +- `_primary_base_dir`: Path to primary base dir +- `MUTATING_TOOLS`: frozenset of mutating tool names +- `perf_monitor_callback`: Optional callback for performance metrics +- `MCP_TOOL_SPECS`: List of 26 tool specification dicts + +**Module-Level Functions:** +- `configure(file_items, extra_base_dirs)`: Build allowlist from file_items +- `_is_allowed(path)`: Check if path is within allowlist +- `_resolve_and_check(raw_path)`: Resolve path and verify it passes allowlist +- `read_file(path)`: Read file content +- `list_directory(path)`: List directory entries +- `search_files(path, pattern)`: Search files by glob pattern +- `get_file_summary(path)`: Return heuristic summary via summarize.py +- `py_get_skeleton(path)`: Get AST skeleton of Python file +- `py_get_code_outline(path)`: Get code outline with line ranges +- `get_file_slice(path, start_line, end_line)`: Read specific line range +- `set_file_slice(path, start_line, end_line, new_content)`: Replace line range +- `edit_file(path, old_string, new_string, replace_all)`: String replacement +- `_get_symbol_node(tree, name)`: Find AST node by name +- `py_get_definition(path, name)`: Get full source of class/function +- `py_update_definition(path, name, new_content)`: Update definition via AST +- `py_get_signature(path, name)`: Get function/method signature +- `py_set_signature(path, name, new_signature)`: Replace signature +- `py_get_class_summary(path, name)`: Get class method signatures +- `py_get_var_declaration(path, name)`: Get variable declaration +- `py_set_var_declaration(path, name, new_declaration)`: Replace variable declaration +- `get_git_diff(path, base_rev, head_rev)`: Get git diff +- `py_find_usages(path, name)`: Find symbol usages +- `py_get_imports(path)`: Get file imports +- `py_check_syntax(path)`: Check Python syntax +- `py_get_hierarchy(path, class_name)`: Find subclasses +- `py_get_docstring(path, name)`: Get docstring +- `get_tree(path, max_depth)`: Get directory tree +- `web_search(query)`: Search via DuckDuckGo +- `fetch_url(url)`: Fetch URL content +- `get_ui_performance()`: Get UI performance metrics +- `dispatch(tool_name, tool_input)`: Dispatch tool call to implementation + +**Read-Only Tools:** +- run_powershell (via shell_runner.py) +- read_file, list_directory, search_files, get_file_summary +- py_get_skeleton, py_get_code_outline, get_file_slice +- py_get_definition, py_get_signature, py_get_class_summary +- py_get_var_declaration, get_git_diff, py_find_usages +- py_get_imports, py_check_syntax, py_get_hierarchy, py_get_docstring +- get_tree, web_search, fetch_url, get_ui_performance + +**Mutating Tools:** +- set_file_slice, py_update_definition, py_set_signature, py_set_var_declaration, edit_file + +**Security:** +- All paths resolved to absolute paths +- Allowlist built from file_items + base_dirs +- Blacklist: history files never allowed +- MUTATING_TOOLS tracked for HITL enforcement + +**Usage Pattern:** AI calls these tools via function dispatch; security enforced at dispatch layer. + +### MMA Orchestration (4 modules, ~26KB) + +#### 9. multi_agent_conductor.py (13.7KB) +**Purpose:** 4-tier orchestration engine + +**Class: `ConductorEngine`** +- `__init__(track, event_queue, auto_queue)`: Initialize engine +- `_push_state(status, active_tier)`: Push MMA state update to GUI +- `parse_json_tickets(json_str)`: Parse JSON tickets into Ticket objects +- `run(md_content)`: Main execution loop + - While True: + - ready_tasks = self.engine.tick() + - If no ready_tasks: check if all done or blocked + - If any in_progress: await asyncio.sleep(1) # Waiting for async workers + - else: await self._push_state("blocked") # No executable tasks + - For each ready task: + - If in_progress or (auto_queue and not step_mode): + - Mark in_progress, spawn worker + - Else if todo: await for HITL approval + `confirm_execution(payload, event_queue, ticket_id, loop)`: Push HITL approval for execution +- `confirm_spawn(role, prompt, context_md, event_queue, ticket_id, loop)`: Push HITL approval for spawn + +**Module Functions:** +- `_queue_put(event_queue, loop, event_name, payload)`: Thread-safe queue push +- `confirm_execution(payload, event_queue, ticket_id, loop)`: Push HITL approval for execution +- `confirm_spawn(role, prompt, context_md, event_queue, ticket_id, loop)`: Push HITL approval for spawn +- `run_worker_lifecycle(ticket, context, context_files, event_queue, engine, md_content, loop)`: Execute single ticket + - Reset ai_client session (Context Amnesia) + - Build context with AST skeletons + - Call ai_client.send() with tools + - Handle blocking, completion, errors + - Update ticket status + - Push state updates + +**Usage Pattern:** Orchestrates track execution through DAG-driven worker lifecycle. + +#### 10. conductor_tech_lead.py (2.8KB) +**Purpose:** Tier 2 ticket generation + +**Functions:** +- `generate_tickets(track_brief, module_skeletons)`: + - Set tier2_sprint_planning system prompt + - Call ai_client.send() with track brief + skeletons + - Extract JSON tickets from response (defensive parsing) + - Return list of ticket dicts +- `topological_sort(tickets)`: + - Convert dicts to Ticket objects + - Build TrackDAG + - Call dag.topological_sort() + - Return ordered ticket dicts + - Raise ValueError on cycle or missing dependency + +**Usage Pattern:** Converts track brief into executable ticket DAG. + +#### 11. orchestrator_pm.py (4.2KB) +**Purpose:** Tier 1 strategic planning + +**Constants:** +- `CONDUCTOR_PATH`: Path("conductor") + +**Functions:** +- `get_track_history_summary()`: Scan conductor/archive/ and conductor/tracks/ + - Read metadata.json from all tracks + - Build summary markdown string +- `generate_tracks(user_request, project_config, file_items, history_summary)`: + - Set tier1_epic_init system prompt + - Call ai_client.send() with user request + context + - Extract JSON tracks from response + - Return list of track dicts + +**Usage Pattern:** Breaks epic into implementation tracks. + +#### 12. mma_prompts.py (5.4KB) +**Purpose:** System prompt templates for hierarchical orchestration + +**Prompt Dictionary:** +- `TIER1_BASE_SYSTEM`: Tier 1 base system prompt +- `TIER1_EPIC_INIT`: Epic initialization prompt +- `TIER1_TRACK_DELEGATION`: Track delegation prompt +- `TIER1_MACRO_MERGE`: Macro-merge and acceptance prompt +- `TIER2_BASE_SYSTEM`: Tier 2 base system prompt +- `TIER2_SPRINT_PLANNING`: Sprint planning prompt +- `TIER2_CODE_REVIEW`: Code review prompt +- `TIER2_TRACK_FINALIZATION`: Track finalization prompt +- `TIER2_CONTRACT_FIRST`: Contract-first delegation prompt + +**Usage Pattern:** Provides structured, constraint-focused prompts for each tier. + +#### 13. dag_engine.py (5.4KB) +**Purpose:** Dependency graph and execution engine + +**Class: `TrackDAG`** +- `__init__(tickets)`: Initialize DAG with ticket list +- `ticket_map`: O(1) lookup by ID +- `cascade_blocks()`: Transitively mark todo tickets as blocked if any dependency is blocked +- `get_ready_tasks()`: Return tickets where status==todo and all deps are completed +- `has_cycle()`: DFS cycle detection (returns bool) + +**Class: `ExecutionEngine`** +- `__init__(dag, auto_queue)`: Initialize engine +- `tick()`: Evaluate DAG and return ready tasks + - If auto_queue: auto-promote non-step_mode tasks to in_progress +- `approve_task(task_id)`: Manually transition todo to in_progress +- `update_task_status(task_id, status)`: Force-update status + +**Usage Pattern:** Manages ticket execution with dependency resolution and auto-queue vs step_mode. +### Project & Context Management (5 modules, ~40KB) + +#### 14. project_manager.py (13.2KB) +**Purpose:** TOML config and history persistence + +**Constants:** +- `TS_FMT`: Timestamp format string +- `CONFIG_PATH`: Path to global config + +**Functions:** +- `now_ts()`: Return formatted timestamp +- `parse_ts(s)`: Parse timestamp string to datetime +- `entry_to_str(entry)`: Serialize dict to TOML string +- `str_to_entry(raw, roles)`: Parse TOML string to dict +- `get_git_commit(git_dir)`: Get current commit hash +- `get_git_log(git_dir, n)`: Get last n git log entries +- `default_discussion()`: Return empty discussion dict +- `default_project(name)`: Return default project dict +- `get_history_path(project_path)`: Return path to history TOML +- `load_project(path)`: Load project TOML (with legacy migration) +- `load_history(project_path)`: Load segregated history file +- `clean_nones(data)`: Recursively remove None values +- `save_project(proj, path, disc_data)`: Save project (segregates discussion) +- `migrate_from_legacy_config(cfg)`: Migrate legacy config to new format +- `flat_config(proj, disc_name, track_id)`: Return flat config dict +- `save_track_state(track_id, state, base_dir)`: Save TrackState to TOML +- `load_track_state(track_id, base_dir)`: Load TrackState from TOML +- `load_track_history(track_id, base_dir)`: Load track history from state +- `save_track_history(track_id, history, base_dir)`: Save track history to state +- `get_all_tracks(base_dir)`: Scan tracks directory and return track metadata list + +**Usage Pattern:** Manages all project and track persistence operations. + +#### 15. aggregate.py (14.2KB) +**Purpose:** Context construction for AI + +**Functions:** +- `find_next_increment(output_dir, namespace)`: Find next filename increment +- `is_absolute_with_drive(entry)`: Check if path is absolute with drive letter +- `resolve_paths(base_dir, entry)`: Resolve glob patterns to absolute paths +- `build_discussion_section(history)`: Build discussion history markdown +- `build_files_section(base_dir, files)`: Build files section markdown +- `build_screenshots_section(base_dir, screenshots)`: Build screenshots section markdown +- `build_file_items(base_dir, files)`: Read all files and build item dicts +- `build_summary_section(base_dir, files)`: Build compact summary via summarize.py +- `_build_files_section_from_items(file_items)`: Build files markdown from pre-read items +- `build_markdown_from_items(file_items, screenshot_base_dir, screenshots, history, summary_only)`: Build full markdown +- `build_markdown_no_history(file_items, screenshot_base_dir, screenshots, summary_only)`: Build markdown without history +- `build_discussion_text(history)`: Build discussion text only +- `build_tier1_context(file_items, screenshot_base_dir, screenshots, history)`: Tier 1 context (strategic, full conductor files) +- `build_tier2_context(file_items, screenshot_base_dir, screenshots, history)`: Tier 2 context (architectural, full files) +- `build_tier3_context(file_items, screenshot_base_dir, screenshots, history, focus_files)`: Tier 3 context (execution, focus files full, others skeletal) +- `build_markdown(base_dir, files, screenshot_base_dir, screenshots, history, summary_only)`: Main context builder +- `run(config)`: Main entry point - loads project and builds markdown + +**Usage Pattern:** Constructs tier-specific context for different MMA tiers. + +#### 16. file_cache.py (5.2KB) +**Purpose:** AST parsing for code views + +**Class: `ASTParser`** +- `__init__(language)`: Initialize parser with tree-sitter +- `parse(code)`: Parse code to tree-sitter Tree +- `get_skeleton(code)`: + - Parse code to AST + - Walk function_definition nodes + - Replace function bodies with "..." (preserve docstrings) +- `get_curated_view(code)`: + - Same as get_skeleton + - BUT: preserve bodies with @core_logic decorator or # [HOT] comments + +**Usage Pattern:** Generate compact code views for worker context injection. + +#### 17. summarize.py (6.5KB) +**Purpose:** Heuristic file summarizer (no AI calls) + +**Summarizer Dictionary:** +- `_SUMMARISERS`: dict mapping extensions to summarizer functions + +**Summarizer Functions:** +- `_summarise_python(path, content)`: + - Parse with ast + - Extract imports (deduplicated module names) + - Extract ALL_CAPS constants + - Extract classes with method names + - Extract top-level function names + - Output: "**Python — N lines**\nimports: ...\nconstants: ...\nclass ClassName: method1, method2\nfunctions: ..." + +- `_summarise_toml(path, content)`: + - Parse TOML + - Extract top-level table keys + - Count array lengths + - Output table keys and sizes + +- `_summarise_markdown(path, content)`: + - Extract h1-h3 headings via regex + - Output heading hierarchy + +- `_summarise_generic(path, content)`: + - Count lines + - Output first 8 lines as preview + +**Module Functions:** +- `summarise_file(path, content)`: Dispatch to appropriate summarizer +- `summarise_items(file_items)`: Summarize all file items +- `build_summary_markdown(file_items)`: Build markdown summary string + +**Usage Pattern:** Generate token-efficient structural summaries. + +#### 18. outline_tool.py (1.9KB) +**Purpose:** Code structure mapper + +**Class: `CodeOutliner`** +- `__init__()`: Initialize +- `outline(code)`: + - Parse with ast + - Walk top-level nodes + - Extract ClassDef ? "[Class] Name (Lines X-Y)" + - Extract FunctionDef ? "[Func] Name (Lines X-Y)" + - Extract AsyncFunctionDef ? "[Async Func] Name (Lines X-Y)" + - Extract method signatures from ClassDef bodies + - Output hierarchical outline + +**Usage Pattern:** Generate detailed code outlines. +### Logging & History (3 modules, ~17KB) + +#### 19. session_logger.py (5.9KB) +**Purpose:** Timestamped audit trails + +**Module-Level State:** +- `_LOG_DIR`: Path("./logs/sessions") +- `_SCRIPTS_DIR`: Path("./scripts/generated") +- `_ts`: Session timestamp string (YYYYMMDD_HHMMSS) +- `_session_id`: Session ID string +- `_session_dir`: Path to session subdirectory +- `_seq`: Monotonic counter for script files +- `_seq_lock`: threading.Lock for counter +- `_comms_fh`: File handle for comms.log +- `_tool_fh`: File handle for toolcalls.log +- `_api_fh`: File handle for apihooks.log +- `_cli_fh`: File handle for clicalls.log + +**Functions:** +- `_now_ts()`: Return formatted timestamp +- `open_session(label)`: + - Create session subdirectory + - Open log files (comms, toolcalls, apihooks, clicalls) +- `close_session()`: Flush and close all log files +- `log_api_hook(method, path, payload)`: Write to apihooks.log as JSON-L +- `log_comms(entry)`: Write to comms.log as JSON-L (thread-safe) +- `log_tool_call(script, result, script_path)`: + - Write to toolcalls.log + - Write PS1 script to scripts/generated/ + - Return script path +- `log_cli_call(command, stdin_content, stdout_content, stderr_content, latency)`: Write to clicalls.log as JSON-L + +**Usage Pattern:** All AI interactions logged for audit trails. + +#### 20. log_registry.py (9KB) +**Purpose:** Session metadata persistence + +**Class: `LogRegistry`** +- `__init__(registry_path)`: Initialize with registry path +- `_registry_data`: Dict for registry contents +- `_registry_path`: Path to registry TOML file + +**Methods:** +- `load_registry()`: Load TOML registry from disk +- `save_registry()`: Save registry to TOML disk +- `register_session(session_id, path, start_time)`: Add session to registry +- `update_session_metadata(session_id, message_count, errors, size_kb, whitelisted, reason)`: Update session metadata +- `is_session_whitelisted(session_id)`: Check whitelist status +- `update_auto_whitelist_status(session_id)`: + - Analyze session logs + - Auto-whitelist if errors present, high message count, or large size +- `get_old_non_whitelisted_sessions(cutoff_datetime)`: Return old non-whitelisted sessions + +**Usage Pattern:** Tracks all sessions with metadata for pruning. + +#### 21. log_pruner.py (2.2KB) +**Purpose:** Automated log cleanup + +**Class: `LogPruner`** +- `__init__(log_registry, logs_dir)`: Initialize with registry and logs dir + +**Methods:** +- `prune()`: + - Get old non-whitelisted sessions from registry + - For each session: check total size < 2KB (2048 bytes) + - If small and not whitelisted: delete session directory + +**Usage Pattern:** Automatically deletes insignificant old logs. + +### GUI Layer (2 modules, ~148KB) + +#### 22. gui_2.py (77.6KB) +**Purpose:** ImGui/Dear PyGui interface + +**Class: `App`** +- `__init__()`: Initialize application state + - Load config and project + - Start asyncio event loop thread + - Initialize performance monitor + - Initialize theme + - Setup GUI panels + - Register action handlers + - Load fonts + - `shutdown()`: Cleanly shutdown all services + +**Key Internal Methods:** +- `_handle_approve_tool(user_data)`: UI wrapper for tool approval +- `_handle_approve_mma_step(user_data)`: UI wrapper for MMA step approval +- `_handle_approve_spawn(user_data)`: UI wrapper for spawn approval +- `_handle_generate_send(user_data)`: Handle Generate + Send button click +- `_handle_reset_session(user_data)`: Handle Reset Session button click +- `_test_callback_func_write_to_file(data)`: Dummy test callback +- `_load_active_project()`: Load active project from config +- `_prune_old_logs()`: Async prune old logs on startup +- `_init_ai_and_hooks()`: Wire AI callbacks to GUI handlers +- `_init_actions()`: Build action map for _process_pending_gui_tasks +- `_process_pending_gui_tasks()`: + - Drains task lists under locks + - Execute actions (set_value, click, etc.) + - Handle approval dialogs + - Handle AI responses + - Handle MMA state updates + - Handle pending history adds +- `_render_text_viewer(label, content)`: Render text viewer panel +- `_render_heavy_text(label, content)`: Render large text with clamping +- `_show_menus()`: Render menu bar +- `_gui_func()`: Main ImGui render function +- `_render_projects_panel()`: Render projects hub +- `_render_context_panel()`: Render context hub +- `_render_ai_settings_panel()`: Render AI settings hub +- `_render_discussion_panel()`: Render discussion hub +- `_render_operations_panel()`: Render operations hub +- `_render_provider_panel()`: Render provider selection +- `_render_token_budget_panel()`: Render token budget panel +- `_render_message_panel()`: Render message input +- `_render_response_panel()`: Render AI response +- `_render_tool_calls_panel()`: Render tool call history +- `_render_comms_history_panel()`: Render comms log viewer +- `_render_mma_dashboard()`: Render MMA dashboard + - Track browser + - Ticket DAG visualization + - Tier stream panels + - Ticket list with actions +- `_render_tier_stream_panel(tier_key, stream_key | None)`: Render tier stream panels +- `_render_ticket_dag_node(...)`: Render DAG node +- `_render_comms_history_panel()`: Render comms history panel +- `_render_system_prompts_panel()`: Render system prompt editor +- `_render_theme_panel()`: Render theme selection +- `_load_fonts()`: Load custom fonts +- `_post_init()`: Post-initialization setup +- `run()`: Initialize ImGui runner and start main loop + +**Properties:** +- `current_provider` (get/set): Current AI provider +- `current_model` (get/set): Current model name + +**GUI State Variables:** +- `_project`, `_projects`: Project data +- `_files_base_dir`, `_screenshots_base_dir`: File paths +- `_ai_response`, `_ai_status`: AI response text and status +- `_comms_log`: Comms log entries +- `_tool_log`: Tool call log entries +- `_provider_options`, `_model_options`: Available options +- `_mma_status`, `_active_tier`, `_mma_streams`: MMA state +- `_active_track`, `_active_tickets`: Track and ticket data +- `_pending_gui_tasks`, `_pending_comms`, `_pending_tool_calls`: `_pending_history_adds`: Task queues +- `_pending_dialog`: Current approval dialog +- `_pending_mma_approval`: MMA step approval +- `_pending_mma_spawn`: MMA spawn approval +- `_pending_ask_dialog`: Ask tool dialog +- `_show_track_proposal`: Track proposal modal flag +- `_proposed_tracks`: List of proposed tracks + +**Threading Primitives:** +- `_loop`: asyncio event loop +- `_loop_thread`: Daemon thread for event loop +- `_pending_gui_tasks_lock`: Lock for GUI task list +- `_pending_comms_lock`: Lock for comms list +- `_pending_tool_calls_lock`: Lock for tool calls list +- `_pending_history_adds_lock`: Lock for history adds +- `_pending_dialog_lock`: Lock for dialog state +- `_send_thread_lock`: Lock for send_thread creation +- `_pending_actions`: Dict for pending actions + +**Usage Pattern:** Main GUI orchestrator with ImGui, event handling, MMA integration. +### Theming (2 modules, ~27KB) + +#### 23. theme.py (15KB) +**Purpose:** Dear PyGui theming (legacy) + +**Palettes:** +- `_PALETTES`: Dict of palette name to color dict + - "DPG Default", "10x Dark", "Nord Dark", "Monokai" + - Each palette maps semantic names to RGB tuples + - WindowBg, ChildBg, PopupBg, Border, FrameBg, etc. + +**Color Mapping:** +- `_COL_MAP`: Maps semantic names to DPG mvThemeCol_* constants + +**State:** +- `_current_theme_tag`: Current theme tag +- `_current_font_tag`: Current font tag +- `_font_registry_tag`: Font registry tag +- `_current_palette`: Current palette name +- `_current_font_path`: Current font file path +- `_current_font_size`: Current font size +- `_current_scale`: Current scale factor + +**Functions:** +- `get_palette_names()`: Return palette names +- `get_current_palette()`: Return current palette +- `get_current_font_path()`: Return font path +- `get_current_font_size()`: Return font size +- `get_palette_colours(name)`: Return color dict for palette +- `apply(palette_name, overrides)`: Apply theme (with optional overrides) +- `apply_font(font_path, size)`: Load TTF font +- `set_scale(factor)`: Set UI scale +- `save_to_config(config)`: Save theme to config +- `load_from_config(config)`: Load theme from config + +**Usage Pattern:** Deprecated Dear PyGui theming system. + +#### 24. theme_2.py (12.4KB) +**Purpose:** ImGui-bundle theming (current) + +**Palettes:** +- `_PALETTES`: Dict of palette name to color dict + - "ImGui Dark", "10x Dark", "Nord Dark", "Monokai" + - Each palette maps imgui.Col_ enum values to RGBA tuples + +**State:** +- `_current_palette`: Current palette name +- `_current_font_path`: Current font file path +- `_current_font_size`: Current font size +- `_current_scale`: Current scale factor +- `_custom_font`: Loaded font object + +**Functions:** +- `get_palette_names()`: Return palette names +- `get_current_palette()`: Return current palette +- `get_current_font_path()`: Return font path +- `get_current_font_size()`: Return font size +- `get_current_scale()`: Return scale +- `_c(r, g, b, a)`: Helper to convert RGB to RGBA [0-1] +- `apply(palette_name)`: Apply palette colors to imgui style +- `set_scale(factor)`: Set font scale +- `save_to_config(config)`: Save theme to config +- `load_from_config(config)`: Load theme and scale from config +- `apply_current()`: Apply loaded palette and scale +- `get_font_loading_params()`: Return (font_path, size) for font loading + +**Usage Pattern:** Current ImGui-bundle theming system. +### Utilities (1 module, ~1KB) + +#### 25. cost_tracker.py (1.2KB) +**Purpose:** Token cost estimation + +**Constants:** +- `MODEL_PRICING`: List of (pattern, pricing_dict) tuples + - gemini-2.5-flash-lite: $0.075/$0.30 per 1M tokens + - gemini-2.5-flash: $0.15/$0.60 per 1M tokens + - gemini-3-flash-preview: $0.15/$0.60 per 1M tokens + - gemini-3.1-pro-preview: $3.50/$10.50 per 1M tokens + - claude-sonnet: $3.00/$15.00 per 1M tokens + - claude-opus: $15.00/$75.00 per 1M tokens + - deepseek-v3: $0.27/$1.10 per 1M tokens + +**Functions:** +- `estimate_cost(model, input_tokens, output_tokens)`: Calculate total cost in USD + +**Usage Pattern:** Estimate token costs for budget tracking. +## Part 8: Deep Testing & Simulation Architecture Analysis + +### Complete File Inventory (100+ test files, ~50KB total) + +#### Fixtures & Infrastructure (4 files) + +##### conftest.py (10.7KB) +**Purpose:** Central test configuration and shared fixtures + +**Classes:** +- `VerificationLogger`: Structured diagnostic logging + - `__init__(test_name, script_name)`: Initialize with test name and logs dir + - `log_state(field, before, after, delta)`: Log state change + - `finalize(title, status, result_msg)`: Finalize test log + +**Functions:** +- `kill_process_tree(pid)`: Robustly kill process and all children + - Windows: `taskkill /F /T /PID ` + - Unix: `os.killpg(os.getpgid(pid), SIGKILL)` + +**Fixtures:** +- `vlogger(request)`: Provide VerificationLogger instance +- `mock_app()`: Mock version of App for simple unit tests +- `app_instance()`: Centralized App instance with all external side effects mocked +- `live_gui(scope="session")`: + - Spawn sloppy.py with --enable-test-hooks + - Use CREATE_NEW_PROCESS_GROUP on Windows + - Redirect stdout/stderr to logs/gui_2_py_test.log + - Poll GET /status up to 15 seconds + - Check process.poll() each iteration to detect early crashes + - If hook server never responds: kill process, pytest.fail() + - In finally: reset_session(), kill_process_tree(), close log_file + - Yield (process, gui_script) + +**Issues:** +- No cleanup verification in reset_ai_client fixture (no teardown) +- Timeout-based readiness check may miss transient initialization states + +##### mock_alias_tool.py (1.1KB) +**Purpose:** Simulates tool calling with alias resolution + +**Behavior:** +- Reads prompt from stdin +- If `'"role": "tool"'` in prompt: Returns mock success +- Else: Calls bridge script (cli_tool_bridge.py) for alias resolution +- Outputs two JSON-L lines: message + result + +##### mock_context_bleed.py (382 bytes) +**Purpose:** Mock agent for bleed testing + +**Behavior:** +- Prints init event with session_id +- Prints user message +- Prints assistant response +- Prints result with stats + +##### mock_gemini_cli.py (7.4KB) +**Purpose:** Fake Gemini CLI executable for integration tests + +**Main Function:** +- `main()`: + - Check sys.argv for management commands ("mcp", "extensions", "skills", "hooks") + - If management command: return silently + - Read prompt from stdin + - Route to response based on keyword matching + +**Response Routing:** +- `'"role": "tool"'` OR `"tool_call_id"` in prompt: + - Return "Tool worked!" success message +- `'"PATH: Epic Initialization"'` in prompt: + - Return two mock Track objects with IDs "mock-track-1", "mock-track-2" + - Session ID: "mock-session-epic" +- `'"PATH: Sprint Planning"'` in prompt: + - Return two mock Ticket objects + - "mock-ticket-1": independent + - "mock-ticket-2": depends on "mock-ticket-1" + - Session ID: "mock-session-sprint" +- Default (Tier 3 worker prompts): + - Return "SUCCESS: Mock Tier 3 worker implemented change. [MOCK OUTPUT]" + - Session ID: "mock-session-default" + +**Output Format:** +- Every response is exactly two JSON-L lines +- Line 1: `{"type": "message", "role": "assistant", "content": "..."}` +- Line 2: `{"type": "result", "status": "success", "stats": {"total_tokens": N}, "session_id": "mock-session-*"}` + +**Debug Output:** +- All debug info goes to stderr, keeping stdout clean for JSON-L protocol + +**Critical Flaws:** +1. **No input validation**: Never checks if prompt contains garbage +2. **No logic verification**: Never verifies Tier 2 actually generated valid tickets, only that it called prompt +3. **No failure modes**: Real LLM errors (malformed JSON, timeouts, rate limits) never tested +4. **No tool flow verification**: Doesn't verify tools were called correctly +5. **Deterministic responses**: Can't test conversation context, multi-turn flows, or error recovery + +#### Architecture Boundary Tests (3 files) + +##### test_arch_boundary_phase1.py (3.8KB) +**Purpose:** Tests architecture boundary hardening — Phase 1 + +**Test Class:** `TestArchBoundaryPhase1(unittest.TestCase)` + +**Tests:** +- `test_unfettered_modules_constant_removed()`: Check "UNFETTERED_MODULES" string absent from execute_agent source +- `test_full_module_context_never_injected()`: Verify "FULL MODULE CONTEXT" not in captured input for mcp_client +- `test_skeleton_used_for_mcp_client()`: Verify "DEPENDENCY SKELETON" is used for mcp_client +- `test_mma_exec_no_hardcoded_path()`: mma_exec.execute_agent must not contain hardcoded machine paths +- `test_claude_mma_exec_no_hardcoded_path()`: claude_mma_exec.execute_agent must not contain hardcoded machine paths + +**Coverage:** Code structure verification only + +##### test_arch_boundary_phase2.py (6.7KB) +**Purpose:** Tests architecture boundary hardening — Phase 2 + +**Constants:** +- `MUTATING_TOOLS`: {"set_file_slice", "py_update_definition", "py_set_signature", "py_set_var_declaration"} +- `ALL_DISPATCH_TOOLS`: All 26 tool names + +**Functions:** +- `test_toml_exposes_all_dispatch_tools()`: manual_slop.toml [agent.tools] must list every tool +- `test_toml_mutating_tools_disabled_by_default()`: Mutating tools must default to false in manual_slop.toml +- `test_default_project_exposes_all_dispatch_tools()`: default_project() agent.tools must list every tool +- `test_default_project_mutating_tools_disabled()`: Mutating tools must default to False in default_project() +- `test_gui_agent_tool_names_exposes_all_dispatch_tools()`: AGENT_TOOL_NAMES in gui_2.py must include every tool +- `test_mcp_client_has_mutating_tools_constant()`: mcp_client must expose MUTATING_TOOLS frozenset +- `test_mutating_tools_contains_write_tools()`: MUTATING_TOOLS must include all four write tools +- `test_mutating_tools_excludes_read_tools()`: MUTATING_TOOLS must not include read-only tools +- `test_mutating_tool_triggers_pre_tool_callback(monkeypatch)`: When mutating tool is called and pre_tool_callback is set, it must be invoked +- `test_mutating_tool_skips_callback_when_rejected(monkeypatch)`: When pre_tool_callback returns None (rejected), dispatch must NOT be called +- `test_non_mutating_tool_skips_callback()`: Read-only tools must NOT trigger pre_tool_callback + +**Coverage:** Tool config exposure and HITL enforcement verification + +##### test_arch_boundary_phase3.py (2.9KB) +**Purpose:** Tests architecture boundary hardening — Phase 3 (not in current set) + +**Tests:** +- `test_cascade_blocks_simple()`: Blocked dependency blocks immediate dependent +- `test_cascade_blocks_multi_hop()`: Blocking cascades through multiple levels +- `test_cascade_blocks_no_cascade_to_completed()`: Completed tasks not changed even if dependency blocked +- `test_cascade_blocks_partial_dependencies()`: Partial dependencies blocked ? dependent blocked +- `test_cascade_blocks_already_in_progress()`: In-progress tasks not blocked automatically + +**Coverage:** DAG blocking cascade verification +#### API Hooks & Integration (8 files) + +##### test_api_hook_client.py (3.5KB) +**Purpose:** Test ApiHookClient methods against live GUI + +**Tests:** +- `test_get_status_success(live_gui)`: Check get_status retrieves server health +- `test_get_project_success(live_gui)`: Check project data retrieval +- `test_get_session_success(live_gui)`: Check session data retrieval +- `test_post_gui_success(live_gui)`: Check GUI data posting +- `test_get_performance_success(live_gui)`: Check performance metrics retrieval +- `test_unsupported_method_error()`: Unsupported HTTP method raises ValueError +- `test_get_text_value()`: Test get_text_value wrapper +- 'test_get_node_status()`: Test DAG node status retrieval + +**Coverage:** Hook API client verification + +##### test_api_hook_extensions.py (1.9KB) +**Purpose:** Test API hook extensions for UI interaction + +**Tests:** +- `test_api_client_has_extensions()`: Verify ApiHookClient has extension methods +- `test_select_tab_integration(live_gui)`: Test tab selection via hooks +- `test_select_list_item_integration(live_gui)`: Test list item selection +- `test_get_indicator_state_integration(live_gui)`: Test indicator state retrieval +- `test_app_processes_new_actions()`: Test new action processing + +**Coverage:** UI interaction via Hook API + +##### test_conductor_api_hook_integration.py (2.7KB) +**Purpose:** Test Conductor integration via hooks + +**Functions:** +- `simulate_conductor_phase_completion(client)`: Simulates Conductor phase completion using ApiHookClient +- `test_conductor_integrates_api_hook_client_for_verification(live_gui)`: Verify Conductor uses ApiHookClient for verification +- `test_conductor_handles_api_hook_failure(live_gui)`: Verify Conductor handles API hook verification failure +- `test_conductor_handles_api_hook_connection_error()`: Verify Conductor handles connection error + +**Coverage:** Conductor-Hook API integration verification + +##### test_headless_service.py (7KB) +**Purpose:** Test headless API service + +**Test Class:** `TestHeadlessAPI(unittest.TestCase)` + +**Tests:** +- `test_health_endpoint()`: Check /status endpoint +- `test_status_endpoint_unauthorized()`: Check /status without API key +- `test_status_endpoint_authorized()`: Check /status with valid API key +- `test_generate_endpoint()`: Check /generate endpoint +- `test_pending_actions_endpoint()`: Check /pending_actions endpoint +- `test_confirm_action_endpoint()`: Check /confirm endpoint +- `test_list_sessions_endpoint()`: Check /sessions endpoint +- `test_get_context_endpoint()`: Check /context endpoint +- `test_endpoint_no_api_key_configured()`: Check behavior without API key + +**Test Class:** `TestHeadlessStartup(unittest.TestCase)` + +**Tests:** +- `test_headless_flag_prevents_gui_run()`: --headless flag prevents GUI, runs FastAPI +- `test_normal_startup_calls_gui_run()`: Normal startup calls GUI + +**Functions:** +- `test_fastapi_installed()`: Verify FastAPI installed +- `test_uvicorn_installed()`: Verify Uvicorn installed + +**Coverage:** FastAPI endpoints and startup verification + +##### test_headless_verification.py (7KB) +**Purpose:** Test headless verification without GUI + +**Tests:** +- `test_headless_verification_full_run(vlogger)`: + - Initialize ConductorEngine with Track + - Simulate full execution run + - Mock ai_client.send for successful tool calls and final responses + - Verify Context Amnesia is maintained +- `test_headless_verification_error_and_qa_interceptor(vlogger)`: + - Simulate shell error + - Verify Tier 4 QA interceptor is triggered + - Verify summary is injected into worker history for next retry + +**Coverage:** Full conductor run verification +#### GUI Testing (26 files) + +##### test_gui2_events.py (1.7KB) +**Purpose:** Test GUI event subscriptions + +**Fixture:** +- `app_instance()`: Create App instance with mocked render functions + +**Tests:** +- `test_app_subscribes_to_events(app_instance)`: Verify App.__init__ subscribes necessary event handlers +- `test_handle_ai_response_resets_stream(app_instance)`: Verify handle_ai_response replaces/finalizes stream +- `test_user_request_event_payload()`: Verify UserRequestEvent payload structure +- `test_async_event_queue()`: Verify AsyncEventQueue put/get + +**Coverage:** Event system integration + +##### test_gui2_layout.py (1KB) +**Purpose:** Test GUI hub layout + +**Tests:** +- `test_gui2_hubs_exist_in_show_windows(app_instance)`: Verify new Hub windows in show_windows +- `test_gui2_old_windows_removed_from_show_windows(app_instance)`: Verify old windows removed + +**Coverage:** Layout structure verification + +##### test_gui2_mcp.py (2KB) +**Purpose:** Test MCP tool dispatch + +**Tests:** +- `test_mcp_tool_call_is_dispatched(app_instance)`: Verify tool calls dispatched to mcp_client + +**Coverage:** MCP integration verification + +##### test_gui2_parity.py (3KB) +**Purpose:** Test GUI hooks for value setting and clicking + +**Tests:** +- `test_gui2_set_value_hook_works(live_gui)`: Test set_value GUI hook +- `test_gui2_click_hook_works(live_gui)`: Test click hook for Reset button +- `test_gui2_custom_callback_hook_works(live_gui)`: Test custom_callback hook + +**Coverage:** Hook parity verification + +##### test_gui2_performance.py (2.3KB) +**Purpose:** Test performance benchmarking + +**Tests:** +- `test_performance_benchmarking(live_gui)`: Collects performance metrics for current GUI script +- `test_performance_baseline_check()`: Verifies performance metrics exist + +**Coverage:** Performance tracking verification + +##### test_gui_async_events.py (2.5KB) +**Purpose:** Test async event routing + +**Tests:** +- `test_handle_generate_send_pushes_event(mock_gui)`: Verify handle_generate_send pushes UserRequestEvent +- `test_user_request_event_payload()`: Verify UserRequestEvent structure +- `test_async_event_queue()`: Verify AsyncEventQueue operation + +**Coverage:** Async event system verification + +##### test_gui_diagnostics.py (861 bytes) +**Purpose:** Test diagnostics panel + +**Tests:** +- `test_diagnostics_panel_initialization(app_instance)`: Verify diagnostics panel initializes +- `test_diagnostics_history_updates(app_instance)`: Verify performance history updates correctly + +**Coverage:** Diagnostics verification + +##### test_gui_events.py (803 bytes) +**Purpose:** Test GUI event handling + +**Tests:** +- `test_gui_updates_on_event(app_instance)`: Verify GUI updates on events + +**Coverage:** Event-driven UI updates + +##### test_gui_phase3.py (3.1KB) +**Purpose:** Test GUI phase 3 features + +**Tests:** +- `test_track_proposal_editing(app_instance)`: Track proposal editing +- `test_conductor_setup_scan(app_instance, tmp_path)`: Conductor directory scan +- `test_create_track(app_instance, tmp_path)`: Track creation + +**Coverage:** Track proposal workflow + +##### test_gui_phase4.py (7.4KB) +**Purpose:** Test GUI phase 4 features + +**Tests:** +- `test_add_ticket_logic(mock_app)`: Add ticket logic +- `test_remove_ticket_logic(mock_app)`: Remove ticket logic +- `test_toggle_ticket_logic(mock_app)`: Toggle ticket logic + +**Coverage:** Ticket management workflow + +##### test_gui_streaming.py (4.1KB) +**Purpose:** Test MMA streaming + +**Tests:** +- `test_mma_stream_event_routing(app_instance)`: Verify "mma_stream" events reach mma_streams +- `test_mma_stream_multiple_workers(app_instance)`: Verify streaming works for multiple workers +- `test_handle_ai_response_resets_stream(app_instance)`: Verify final response replaces stream +- `test_handle_ai_response_streaming(app_instance)`: Verify streaming appends to mma_streams + +**Coverage:** Stream routing verification + +##### test_gui_stress_performance.py (1.8KB) +**Purpose:** Test stress performance + +**Tests:** +- `test_comms_volume_stress_performance(live_gui)`: Inject many session entries and verify performance + +**Coverage:** Stress testing + +##### test_gui_updates.py (1.7KB) +**Purpose:** Test GUI update mechanisms + +**Tests:** +- `test_telemetry_data_updates_correctly(app_instance)`: Verify telemetry updates +- 'test_performance_history_updates(app_instance)`: Verify performance history +- `test_gui_updates_on_event(app_instance)`: Verify GUI updates on events + +**Coverage:** Update mechanism verification + +##### test_layout_reorganization.py (2.4KB) +**Purpose:** Test new consolidated Hub layout + +**Tests:** +- `test_new_hubs_defined_in_show_windows(mock_app)`: Verify Hub windows defined +- `test_old_windows_removed_from_gui2(app_instance)`: Verify old windows removed + +**Coverage:** Layout consolidation verification + +##### test_live_gui_integration.py (3.1KB) +**Purpose:** Test user request flow integration + +**Tests:** +- `test_user_request_integration_flow(mock_app)`: Verify UserRequestEvent triggers AI and updates UI + +**Coverage:** Integration flow verification + +##### test_live_workflow.py (3KB) +**Purpose:** Test full GUI workflow + +**Tests:** +- `test_full_live_workflow(live_gui)`: Integration test driving GUI through full workflow + +**Coverage:** End-to-end workflow verification +#### MMA & Conductor (11 files) + +##### test_conductor_engine.py (14.6KB) +**Purpose:** Test ConductorEngine implementation + +**Tests:** +- `test_conductor_engine_initialization()`: Test ConductorEngine initialization +- `test_conductor_engine_run_executes_tickets_in_order(monkeypatch, vlogger)`: Test run() executes tickets in order +- `test_run_worker_lifecycle_calls_ai_client_send(monkeypatch)`: Test worker lifecycle triggers AI client send +- `test_run_worker_lifecycle_context_injection(monkeypatch)`: Test worker lifecycle injects AST views +- `test_run_worker_lifecycle_blocked(mock_ai_client)`: Test worker marks ticket as blocked +- `test_run_worker_lifecycle_step_mode_confirmation(monkeypatch)`: Test step mode confirmation +- `test_conductor_engine_dynamic_parsing_and_execution(monkeypatch, vlogger)`: Test dynamic parsing and execution +- `test_run_worker_lifecycle_streams_response_via_queue(monkeypatch)`: Test streaming response via queue +- `test_run_worker_lifecycle_token_usage_from_comms_log(monkeypatch)`: Test token usage from comms log + +**Coverage:** Conductor engine behavior verification + +##### test_conductor_tech_lead.py (3.4KB) +**Purpose:** Test Conductor Tech Lead + +**Test Class:** `TestConductorTechLead(unittest.TestCase)` + +**Tests:** +- `test_generate_tickets_parse_error()`: Test JSON parsing error handling +- `test_generate_tickets_success()`: Test successful ticket generation +- `test_topological_sort_linear()`: Test linear topological sort +- `test_topological_sort_complex()`: Test complex topological sort +- `test_topological_sort_cycle()`: Test cycle detection +- `test_topological_sort_empty()`: Test empty list handling +- `test_topological_sort_missing_dependency()`: Test missing dependency handling + +**Coverage:** Ticket generation and dependency resolution verification + +##### test_dag_engine.py (3.7KB) +**Purpose:** Test TrackDAG and ExecutionEngine + +**Tests:** +- `test_get_ready_tasks_linear()`: Test ready tasks for linear DAG +- `test_get_ready_tasks_branching()`: Test ready tasks for branching DAG +- `test_has_cycle_no_cycle()`: Test cycle detection with no cycles +- `test_has_cycle_direct_cycle()`: Test direct cycle detection +- `test_has_cycle_indirect_cycle()`: Test indirect cycle detection +- `test_has_cycle_complex_no_cycle()`: Test complex DAG with no cycles +- `test_get_ready_tasks_multiple_deps()`: Test multiple dependencies +- `test_topological_sort()`: Test topological sort +- `test_topological_sort_cycle()`: Test topological sort with cycles + +**Coverage:** DAG algorithm verification + +##### test_execution_engine.py (4.1KB) +**Purpose:** Test ExecutionEngine + +**Tests:** +- `test_execution_engine_basic_flow()`: Test basic flow +- `test_execution_engine_update_nonexistent_task()`: Test updating non-existent task +- `test_execution_engine_status_persistence()`: Test status persistence +- `test_execution_engine_auto_queue()`: Test auto_queue behavior +- `test_execution_engine_step_mode()`: Test step_mode behavior +- `test_execution_engine_approve_task()`: Test approve_task behavior + +**Coverage:** Execution engine state machine verification + +##### test_mma_models.py (5.9KB) +**Purpose:** Test MMA data models + +**Tests:** +- `test_ticket_instantiation()`: Test Ticket instantiation with required fields +- `test_ticket_with_dependencies()`: Test Ticket with dependencies +- `test_track_instantiation()`: Test Track instantiation +- `test_track_can_handle_empty_tickets()`: Test Track with empty tickets +- `test_worker_context_instantiation()`: Test WorkerContext instantiation +- `test_ticket_mark_blocked()`: Test ticket.mark_blocked() +- `test_ticket_mark_complete()`: Test ticket.mark_complete() +- `test_track_get_executable_tickets()`: Test track.get_executable_tickets() +- `test_track_get_executable_tickets_complex()`: Test get_executable_tickets with complex dependencies + +**Coverage:** Model invariant verification + +##### test_mma_orchestration_gui.py (4.6KB) +**Purpose:** Test MMA GUI state and orchestration + +**Tests:** +- `test_mma_ui_state_initialization(app_instance)`: Verify UI state initialization +- `test_process_pending_gui_tasks_show_track_proposal(app_instance)`: Verify show_track_proposal action +- `test_cb_plan_epic_launches_thread(app_instance)`: Verify plan epic launches thread +- `test_process_pending_gui_tasks_mma_spawn_approval(app_instance)`: Verify spawn approval action +- `test_handle_ai_response_with_stream_id(app_instance)`: Verify routing to mma_streams +- `test_handle_ai_response_fallback(app_instance)`: Verify fallback to ai_response + +**Coverage:** MMA GUI state management + +##### test_mma_prompts.py (1.9KB) +**Purpose:** Test MMA system prompts + +**Tests:** +- `test_tier1_epic_init_constraints()`: Verify Tier 1 epic init prompt constraints +- `test_tier1_track_delegation_constraints()`: Verify Tier 1 track delegation prompt constraints +- `test_tier1_macro_merge_constraints()`: Verify Tier 1 macro-merge prompt constraints +- `test_tier2_sprint_planning_constraints()`: Verify Tier 2 sprint planning prompt constraints +- `test_tier2_code_review_constraints()`: Verify Tier 2 code review prompt constraints +- `test_tier2_track_finalization_constraints()`: Verify Tier 2 track finalization prompt constraints +- `test_tier2_contract_first_constraints()`: Verify Tier 2 contract-first constraints + +**Coverage:** System prompt constraint verification + +##### test_mma_ticket_actions.py (1.1KB) +**Purpose:** Test MMA ticket actions + +**Tests:** +- `test_cb_ticket_retry(app_instance)`: Test ticket retry callback +- `test_cb_ticket_skip(app_instance)`: Test ticket skip callback + +**Coverage:** Ticket action verification + +##### test_orchestration_logic.py (5KB) +**Purpose:** Test orchestration logic + +**Tests:** +- `test_generate_tracks(mock_ai_client)`: Test Tier 1 track generation +- `test_generate_tickets(mock_ai_client)`: Test Tier 2 ticket generation +- `test_topological_sort()`: Test topological sort +- `test_track_executable_tickets()`: Test track.get_executable_tickets() +- `test_conductor_engine_run(vlogger)`: Test ConductorEngine.run() +- `test_parse_json_tickets()`: Test ticket JSON parsing +- `test_run_worker_lifecycle_blocked(mock_ai_client)`: Test worker blocked handling + +**Coverage:** Orchestration logic verification + +##### test_orchestrator_pm.py (3KB) +**Purpose:** Test Orchestrator PM + +**Test Class:** `TestOrchestratorPM(unittest.TestCase)` + +**Tests:** +- `test_generate_tracks_success(mock_send, mock_summarize)`: Test successful track generation +- `test_generate_tracks_markdown_wrapped(mock_send, mock_summarize)`: Test markdown wrapping +- `test_generate_tracks_malformed_json(mock_send, mock_summarize)`: Test malformed JSON handling + +**Coverage:** Orchestrator PM verification + +##### test_orchestrator_pm_history.py (2.8KB) +**Purpose:** Test Orchestrator PM history + +**Test Class:** `TestOrchestratorPMHistory(unittest.TestCase)` + +**Tests:** +- `test_get_track_history_summary()`: Test history summary generation +- `test_get_track_history_summary_missing_files()`: Test missing file handling +- `test_generate_tracks_with_history(mock_send, mock_summarize, mock_registry)`: Test track generation with history + +**Coverage:** History summary verification +#### MCP & Tools (4 files) + +##### test_agent_capabilities.py (1.3KB) +**Purpose:** Test agent model listing + +**Tests:** +- `test_agent_capabilities_listing()`: Test model listing + +**Coverage:** Capability verification + +##### test_agent_tools_wiring.py (865 bytes) +**Purpose:** Test agent tools wiring + +**Tests:** +- `test_set_agent_tools()`: Test set_agent_tools() function +- `test_build_anthropic_tools_conversion()`: Test Anthropic tools conversion + +**Coverage:** Tool setup verification + +##### test_cli_tool_bridge.py (2.5KB) +**Purpose:** Test CLI tool bridge + +**Test Class:** `TestCliToolBridge(unittest.TestCase)` + +**Tests:** +- `test_allow_decision(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test allow decision flow +- `test_deny_decision(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test deny decision flow +- `test_unreachable_hook_server(mock_request, mock_stdout, mock_stdin, mock_hook)`: Test unreachable server handling + +**Coverage:** Bridge decision logic verification + +##### test_cli_tool_bridge_mapping.py (1.8KB) +**Purpose:** Test CLI tool bridge mapping + +**Test Class:** `TestCliToolBridgeMapping(unittest.TestCase)` + +**Tests:** +- `test_mapping_from_api_format(mock_request, mock_stdout, mock_stdin, mock_hook)`: Verify mapping from API format + +**Coverage:** Format mapping verification + +##### test_mcp_perf_tool.py (607 bytes) +**Purpose:** Test MCP performance tool + +**Tests:** +- `test_mcp_perf_tool_retrieval()`: Test get_ui_performance retrieval + +**Coverage:** Performance tool verification +#### Simulations (8 files) + +##### test_extended_sims.py (2.7KB) +**Purpose:** Extended simulations against live GUI + +**Tests:** +- `test_context_sim_live(live_gui)`: Context simulation +- `test_ai_settings_sim_live(live_gui)`: AI settings simulation +- `test_tools_sim_live(live_gui)`: Tools simulation +- `test_execution_sim_live(live_gui)`: Execution simulation + +**Coverage:** Multi-simulation integration + +##### test_sim_ai_settings.py (1.4KB) +**Purpose:** Test AI settings simulation + +**Tests:** +- `test_ai_settings_simulation_run()`: Test simulation runs correctly + +**Coverage:** AI settings simulation verification + +##### test_sim_base.py (1.2KB) +**Purpose:** Test base simulation class + +**Tests:** +- `test_base_simulation_init()`: Test base simulation initialization +- `test_base_simulation_setup()`: Test base simulation setup + +**Coverage:** Base simulation verification + +##### test_sim_context.py (1.4KB) +**Purpose:** Test context simulation + +**Tests:** +- `test_context_simulation_run()`: Test context simulation runs correctly + +**Coverage:** Context simulation verification + +##### test_sim_execution.py (1.5KB) +**Purpose:** Test execution simulation + +**Tests:** +- `test_execution_simulation_run()`: Test execution simulation runs correctly + +**Coverage:** Execution simulation verification + +##### test_sim_tools.py (1.2KB) +**Purpose:** Test tools simulation + +**Tests:** +- `test_tools_simulation_run()`: Test tools simulation runs correctly + +**Coverage:** Tools simulation verification + +##### test_user_agent.py (633 bytes) +**Purpose:** Test user agent + +**Tests:** +- `test_user_agent_instantiation()`: Test UserSimAgent instantiation +- `test_perform_action_with_delay()`: Test action with delay + +**Coverage:** User agent behavior verification + +##### test_workflow_sim.py (1.5KB) +**Purpose:** Test workflow simulator + +**Tests:** +- `test_simulator_instantiation()`: Test simulator instantiation +- `test_setup_new_project()`: Test project setup +- `test_discussion_switching()`: Test discussion switching +- `test_history_truncation()`: Test history truncation + +**Coverage:** Workflow simulation verification + +##### ping_pong.py (1.6KB) +**Purpose:** Simple ping/pong test + +**Main Function:** +- `main()`: Basic agent interaction verification + +**Coverage:** Minimal simulation test + +##### live_walkthrough.py (2.6KB) +**Purpose:** Full walkthrough script + +**Main Function:** +- `main()`: Orchestrates complete GUI workflow via hooks + +**Coverage:** End-to-end workflow verification +#### Approval & HITL (2 files) + +##### test_spawn_interception.py (3.4KB) +**Purpose:** Test spawn approval interception + +**Tests:** +- `test_confirm_spawn_pushed_to_queue()`: Test confirm_spawn pushes to queue +- `test_rejection_handling(mock_confirm, mock_ai_client, app_instance)`: Test rejection handling +- `test_run_worker_lifecycle_rejected(mock_confirm, mock_ai_client, app_instance)`: Test worker lifecycle on rejection + +**Coverage:** Spawn approval flow verification + +##### test_tier4_interceptor.py (8.1KB) +**Purpose:** Test Tier 4 QA interceptor + +**Tests:** +- `test_run_powershell_qa_callback_on_failure(vlogger)`: Test QA callback on shell failure +- `test_run_powershell_qa_callback_on_stderr_only(vlogger)`: Test QA callback on stderr +- `test_run_powershell_no_qa_callback_on_success()`: Test no QA callback on success +- `test_run_powershell_optional_qa_callback()`: Test optional QA callback +- `test_end_to_end_tier4_integration(vlogger)`: Test end-to-end Tier 4 integration +- `test_ai_client_passes_qa_callback()`: Test ai_client passes qa_callback +- `test_gemini_provider_passes_qa_callback_to_run_script()`: Test Gemini passes qa_callback + +**Coverage:** Tier 4 QA flow verification +#### Token Management (2 files) + +##### test_token_usage.py (2.4KB) +**Purpose:** Test token usage tracking + +**Tests:** +- `test_token_usage_tracking()`: Test token tracking + +**Coverage:** Token usage verification + +##### test_token_viz.py (5.3KB) +**Purpose:** Test token visualization + +**Tests:** +- `test_add_bleed_derived_aliases()`: Test bleed stats aliases +- `test_add_bleed_derived_headroom()`: Test headroom calculation +- `test_add_bleed_derived_would_trim_false()`: Test trim boundary +- `test_add_bleed_derived_would_trim_true()`: Test just below threshold +- `test_add_bleed_derived_breakdown()`: Test breakdown calculation +- `test_add_bleed_derived_history_clamped_to_zero()`: Test history clamping +- `test_add_bleed_derived_headroom_clamped_to_zero()`: Test headroom clamping +- `test_get_history_bleed_stats_returns_all_keys_unknown_provider()`: Test all keys returned +- `test_app_token_stats_initialized_empty(app_instance)`: Test token stats initialization +- `test_app_last_stable_md_initialized_empty(app_instance)`: Test last stable MD initialization +- `test_app_has_render_token_budget_panel(app_instance)`: Test token budget panel exists +- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)`: Test panel rendering with empty stats +- `test_would_trim_boundary_exact()`: Test exact trim boundary +- `test_would_trim_just_below_threshold()`: Test just below threshold +- `test_would_trim_just_above_threshold()`: Test just above threshold +- `test_gemini_cache_fields_accessible()`: Test Gemini cache field access +- `test_anthropic_history_lock_accessible()`: Test Anthropic history lock access + +**Coverage:** Token budget panel verification +#### Tiered Context (1 file) + +##### test_tiered_context.py (5.1KB) +**Purpose:** Test tiered context building + +**Tests:** +- `test_build_tier1_context_exists()`: Test build_tier1_context function exists +- `test_build_tier2_context_exists()`: Test build_tier2_context function exists +- `test_build_tier3_context_ast_skeleton(monkeypatch)`: Test build_tier3_context uses AST skeletons +- `test_build_tier3_context_exists()`: Test build_tier3_context function exists +- `test_build_file_items_with_tiers(tmp_path)`: Test file items with tiers +- `test_build_files_section_with_dicts(tmp_path)`: Test files section building +- `test_tiered_context_by_tier_field()`: Test tiered context by tier field + +**Coverage:** Tiered context verification +#### Logging & History (5 files) + +##### test_history_management.py (8.7KB) +**Purpose:** Test history management + +**Tests:** +- `test_aggregate_includes_segregated_history(tmp_path)`: Test aggregate includes segregated history +- `test_mcp_blacklist(tmp_path)`: Test MCP client blacklists files +- `test_aggregate_blacklist(tmp_path)`: Test aggregate respects blacklisting +- `test_migration_on_load(tmp_path)`: Test migration on load +- `test_save_separation(tmp_path)`: Test save separation +- `test_history_persistence_across_turns(tmp_path)`: Test persistence across turns +- `test_get_history_bleed_stats_basic()`: Test history bleed stats + +**Coverage:** History persistence verification + +##### test_log_management_ui.py (3.2KB) +**Purpose:** Test log management UI + +**Fixture:** +- `mock_config(tmp_path)`: Mock config +- `mock_project(tmp_path)`: Mock project +- `app_instance(mock_config, mock_project, monkeypatch)`: App instance + +**Tests:** +- `test_log_management_init(app_instance)`: Test log management initialization +- `test_render_log_management_logic(app_instance)`: Test render logic + +**Coverage:** Log UI verification + +##### test_log_pruner.py (2.3KB) +**Purpose:** Test log pruning + +**Fixture:** +- `pruner_setup(tmp_path)`: Tuple of LogPruner, LogRegistry, Path + +**Tests:** +- `test_prune_old_insignificant_logs(pruner_setup)`: Test pruning old insignificant logs + +**Coverage:** Log pruning verification + +##### test_log_registry.py (8.6KB) +**Purpose:** Test log registry + +**Test Class:** `TestLogRegistry(unittest.TestCase)` + +**Tests:** +- `test_instantiation()`: Test LogRegistry instantiation +- `test_register_session()`: Test session registration +- `test_update_session_metadata()`: Test metadata update +- `test_is_session_whitelisted()`: Test whitelist checking +- `test_get_old_non_whitelisted_sessions()`: Test retrieval of old sessions + +**Coverage:** Registry functionality verification + +##### test_logging_e2e.py (3KB) +**Purpose:** Test end-to-end logging + +**Fixture:** +- `e2e_setup(tmp_path, monkeypatch)`: Setup for e2e test + +**Tests:** +- `test_logging_e2e(e2e_setup)`: Test full logging e2e + +**Coverage:** Logging e2e verification + +##### test_session_logging.py (2.2KB) +**Purpose:** Test session logging + +**Fixture:** +- `temp_logs(tmp_path, monkeypatch)`: Temporary logs path + +**Tests:** +- `test_open_session_creates_subdir_and_registry(temp_logs)`: Test session directory and registry creation + +**Coverage:** Session logging verification + +##### test_process_pending_gui_tasks.py (2.4KB) +**Purpose:** Test process pending GUI tasks + +**Fixture:** +- `app_instance()`: App instance + +**Tests:** +- `test_redundant_calls_in_process_pending_gui_tasks(app_instance)`: Test no redundant calls +- `test_gcli_path_updates_adapter(app_instance)`: Test gcli_path updates adapter + +**Coverage:** Pending tasks processing verification + +##### test_project_manager_tracks.py (2.7KB) +**Purpose:** Test project manager tracks + +**Tests:** +- `test_get_all_tracks_empty(tmp_path)`: Test get_all_tracks with empty directory +- `test_get_all_tracks_with_state(tmp_path)`: Test get_all_tracks with state +- `test_get_all_tracks_with_metadata_json(tmp_path)`: Test get_all_tracks with metadata +- `test_get_all_tracks_malformed(tmp_path)`: Test get_all_tracks with malformed data + +**Coverage:** Project manager tracks verification + +##### test_session_logging.py (2.2KB) +**Purpose:** Test session logging + +**Fixture:** +- `temp_logs(tmp_path, monkeypatch)`: Temporary logs path + +**Tests:** +- `test_open_session_creates_subdir_and_registry(temp_logs)`: Test session directory and registry creation + +**Coverage:** Session logging verification + +##### test_track_state_persistence.py (3.1KB) +**Purpose:** Test track state persistence + +**Tests:** +- `test_track_state_persistence(tmp_path)`: Test save/load TrackState + +**Coverage:** Track state persistence verification + +##### test_track_state_schema.py (6.1KB) +**Purpose:** Test track state schema + +**Tests:** +- `test_track_state_instantiation()`: Test TrackState instantiation +- `test_track_state_to_dict()`: Test to_dict() method +- `test_track_state_from_dict()`: Test from_dict() class method +- `test_track_state_from_dict_empty_and_missing()`: Test from_dict with empty and missing values +- `test_track_state_to_dict_with_none()`: Test to_dict with None values + +**Coverage:** Track state schema verification + +##### test_tree_sitter_setup.py (860 bytes) +**Purpose:** Test tree-sitter setup + +**Tests:** +- `test_tree_sitter_python_setup()`: Test tree-sitter and tree-sitter-python installed and can parse + +**Coverage:** Tree-sitter installation verification + +##### test_user_agent.py (633 bytes) +**Purpose:** Test user agent + +**Tests:** +- `test_user_agent_instantiation()`: Test UserSimAgent instantiation +- `test_perform_action_with_delay()`: Test action with delay + +**Coverage:** User agent verification + +##### test_vlogger_availability.py (341 bytes) +**Purpose:** Test VerificationLogger availability + +**Tests:** +- `test_vlogger_available(vlogger)`: Test VerificationLogger available + +**Coverage:** Fixture verification + +##### test_workflow_sim.py (1.5KB) +**Purpose:** Test workflow simulator + +**Tests:** +- `test_simulator_instantiation()`: Test simulator instantiation +- `test_setup_new_project()`: Test project setup +- `test_discussion_switching()`: Test discussion switching +- `test_history_truncation()`: Test history truncation + +**Coverage:** Workflow simulation verification + +##### test_ai_style_formatter.py (2.8KB) +**Purpose:** Test AI style formatter + +**Tests:** +- `test_basic_indentation()`: Test 1-space indentation +- `test_top_level_blank_lines()`: Test max one blank line between top-level definitions +- `test_inner_blank_lines()`: Test zero blank lines within function bodies +- `test_multiline_string_safety()`: Test multiline string safety +- `test_continuation_indentation()`: Test continuation indentation +- `test_multiple_top_level_definitions(vlogger)`: Test multiple top-level definitions + +**Coverage:** Style formatting verification +## Part 8: Deep Testing & Simulation Architecture Analysis + +### Gap 1: No Real-Time Latency Simulation + +**Current Implementation:** +```python +# simulation/sim_base.py +time.sleep(random.uniform(0.5, 2.0)) # Fixed delays +``` + +**What's Missing:** +- Variable LLM latency (1-10s) +- Network latency (100-500ms per request) +- UI rendering time (16-33ms per frame) +- Database I/O variance + +**Impact:** +- Tests don't catch timeout issues under real latency +- Tests don't verify streaming UX (chunked text display) +- Tests don't measure actual perceived performance + +### Gap 2: No Human-Like Behavior Simulation + +**Current Implementation:** +```python +# simulation/user_agent.py +class UserSimAgent: + def generate_response(self, conversation_history): + # Simple heuristic-based responses + def perform_action_with_delay(self, action_func): + # Execute action with human-like delay +``` + +**What's Missing:** +- Typing speed (50-200 WPM with variability) +- Hesitation before actions +- Mistakes (wrong button clicks, editing errors) +- Reading time for approval dialogs +- Task switching (window switching, getting distracted) + +**Impact:** +- Tests don't catch UI issues from rapid or erroneous user input +- Tests don't verify users can recover from mistakes +- Tests don't measure actual task completion time + +### Gap 3: Arbitrary Polling Intervals Miss Transient States + +**Current Implementation:** +```python +# All simulation tests +for _ in range(60): # 1-second polls + if condition_met(): + break + time.sleep(1) +``` + +**Problem:** States that exist for <1 second never observed. + +**Examples of Missed States:** +- Loading spinner (200-500ms duration) +- Flickering indicator (50-100ms) +- Transient error message (300ms duration) +- Partial state transitions between A and B + +**Impact:** +- UI glitches and race conditions that only manifest briefly never caught. + +### Gap 4: Mock CLI Redirection - Subprocess Bypass + +**Current Implementation:** +```python +# All integration tests +client.set_value(''gcli_path', mock_cli_path) +``` + +**What's Not Tested:** +- Real subprocess spawning issues (PATH problems, permission errors) +- Environment variable passing +- CLI argument parsing and validation +- Stdin/stderr handling +- Process cleanup on failure + +**Impact:** Integration bugs never caught. + +### Gap 5: State Verification is Shallow + +**Current Implementation:** +```python +# From test_visual_sim_mma_v2.py +status = client.get_mma_status() +tickets = status.get('active_tickets', []) +assert len(tickets) >= 2 # Only checks length + +streams = status.get('mma_streams', {}) +if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]: + streams_found = True # Only checks substring +``` + +**What's Not Verified:** +- Ticket structure valid? +- IDs unique? +- Dependencies correct? +- Stream content complete? +- No errors in output? + +**Impact:** Data integrity issues never caught. + +### Gap 6: No Stress Testing + +**Current Coverage:** None + +**Missing Tests:** +- Load testing with many concurrent requests +- Edge case bombardment (rapid user input, malformed data) +- Performance under resource constraints + +**Impact:** Resource leaks, race conditions never caught. + +## Part 9: Summary Table: Testing Pitfalls by Severity + +| Severity | Issue | False Positive Risk | Files Affected | +|----------|-------|-------------------|----------------| +| **HIGH** | Mock provider always returns success | All integration tests using mock_gemini_cli.py | +| **HIGH** | Auto-approval of all HITL gates | test_visual_sim_mma_v2.py, all simulation tests | +| **HIGH** | Substring-based assertions | All visual and MMA tests | +| **HIGH** | State existence only, no validation | All MMA and conductor tests | +| **HIGH** | No negative path testing | Entire test suite | +| **HIGH** | No state machine validation | All MMA and conductor tests | +| **HIGH** | No concurrent access testing | Entire test suite | +| **MEDIUM** | No visual verification | All integration and visual tests | +| **MEDIUM** | No concurrent access testing | Entire test suite | +| **MEDIUM** | No real-time latency simulation | Simulation tests | +| **MEDIUM** | No human-like behavior | Simulation tests | +| **MEDIUM** | Arbitrary polling intervals miss transient states | All polling tests | +| **MEDIUM** | Mock CLI bypasses subprocess | All integration tests | +| **MEDIUM** | Timeout-based testing brittle | All polling tests | +| **MEDIUM** | State pollution between tests | Tests using reset_ai_client fixture | +| **MEDIUM** | No visual rendering | All visual tests | +| **MEDIUM** | No stress testing | Entire test suite | +| **LOW** | No real-time latency simulation | Simulation tests | +| **LOW** | No human-like behavior | Simulation tests | +| **LOW** | Arbitrary polling intervals miss transient states | All polling tests | + +**Total High Priority Issues:** 7 + +**Total Medium Priority Issues:** 10 + +**Total Low Priority Issues:** 2 + +--- + +## Part 10: Architecture Strengths & Design Patterns Observed + +### Strengths + +1. **Clear Layering:** + - 4-tier MMA with explicit boundaries (Tier 1 PM ? Tier 2 Tech Lead ? Tier 3 Worker ? Tier 4 QA) + - Tier roles clearly defined with responsibilities + - State transitions managed via ConductorEngine + - All tiers use same ai_client interface + +2. **Decoupled Events:** + - EventEmitter for synchronous pub/sub + - AsyncEventQueue for cross-thread async communication + - Clear separation between GUI thread and asyncio worker thread + +3. **HITL Enforcement:** + - MUTATING_TOOLS frozenset for identifying dangerous tools + - pre_tool_callback routing for user approval + - Approval dialogs (ConfirmDialog, MMAApprovalDialog, MMASpawnApprovalDialog) for all destructive actions + +4. **Token Management:** + - History truncation to prevent token bloat + - Tier-scoped context (build_tier[1-3]_context) + - Cache TTL for Gemini (3600s, rebuilt at 90%) + - Token usage tracking per tier + +5. **Multi-Provider Support:** + - Unified ai_client.send() interface + - Supports Gemini, Anthropic, DeepSeek, Gemini CLI + - Provider-specific optimizations (Gemini server-side caching, Anthropic prompt caching) + +6. **Comprehensive Testing Infrastructure:** + - live_gui fixture for session-scoped GUI lifecycle + - kill_process_tree() for clean process cleanup + - VerificationLogger for structured test telemetry + - Artifact isolation (tests/artifacts/, tests/logs/) + +7. **Data-Oriented Design:** + - Minimal use of OOP, preference for data structures and functions + - Performance-oriented architecture (PerformanceMonitor, frame-sync loops) + +### Design Patterns + +1. **Observer Pattern:** + - EventEmitter for lifecycle events + - Listeners registered via events.on() + +2. **Strategy Pattern:** + - Multi-provider AI client with provider-specific methods + - _send_gemini(), _send_anthropic(), _send_deepseek(), _send_gemini_cli() + +3. **Factory Pattern:** + - generate_tickets() creates Ticket objects + - generate_tracks() creates Track objects + - Multi-agent_conductor.run_worker_lifecycle() spawns worker context + +4. **Command Pattern:** + - Action map _process_pending_gui_tasks() for GUI state mutations + - Actions triggered by string keys (set_value, click, select_tab, etc.) + +5. **Dependency Injection:** + - confirm_and_run_callback, comms_log_callback, tool_log_callback injected by GUI + - qa_callback injected for Tier 4 QA + +6. **Template Method:** + - BaseSimulation for all simulation classes + - WorkflowSimulator extends BaseSimulation + +7. **Data-Oriented Design:** + - Minimal use of OOP, preference for data structures and functions + - Performance-oriented architecture (PerformanceMonitor, frame-sync loops) + +### Potential Concerns + +1. **Large Monolithic Files:** + - gui_2.py (77.6KB) - Main GUI orchestrator + - ai_client.py (70.6KB) - Multi-provider abstraction + - app_controller.py (70.1KB) - Headless controller + - mcp_client.py (48.2KB) - 26-tool dispatcher + - Risk: Difficult to navigate, high maintenance burden + +2. **State Duplication:** + - App (GUI) and AppController (headless) share logic + - Many similar methods between two classes + - Violates DRY principle + +3. **Global State in Modules:** + - ai_client.py uses module-level globals for provider state + - Makes testing difficult, prone to state pollution + - Hard to reason about thread-safety + +4. **MCP Client Complexity:** + - 26 tools in single file (mcp_client.py) + - Could be grouped by domain (file ops, Python analysis, web tools) + - Makes file navigation difficult + +5. **Context Amnesia Enforcement:** + - Relies on manual ai_client.reset_session() calls + - No automation to prevent accidental state pollution + - Risk: Workers might accumulate state across tickets + +### Cross-Reference with Existing Tracks + +### test_stabilization_20260302 +- Overlaps: None +- This track addresses asyncio errors and mock-rot ban +- Our audit found: Mock-rot is already structurally banned but enforcement is weak +- Synergy: This audit identifies specific weaknesses in mock provider that stabil track should address + +### codebase_migration_20260302 +- Overlaps: None +- This track restructures to src/ layout +- Our audit focuses on testing infrastructure, not directory structure +- Synergy: Directory restructuring should happen AFTER testing is hardened + +### gui_decoupling_controller_20260302 +- Overlaps: None +- This track extracts state machine from GUI +- Our audit finds: State duplication between App and AppController +- Synergy: Decoupling should include test infrastructure hardening + +### hook_api_ui_state_verification_20260302 +- Overlaps: None +- This track adds /api/gui/state GET endpoint +- Our audit recommends: All tests should use hook server for state verification +- Synergy: High - this enables automated testing our audit recommends + +### robust_json_parsing_tech_lead_20260302 +- Overlaps: None +- This track adds auto-retry for JSON parsing +- Our audit found: Mock provider never produces malformed JSON +- Synergy: Auto-retry won't help if mock always succeeds + +### concurrent_tier_source_tier_20260302 +- Overlaps: None +- This track uses threading.local() for thread-safe logging +- Our audit found: No concurrent access tests +- Synergy: High - threading.local() implementation should include comprehensive testing + +### test_suite_performance_and_flakiness_20260302 +- Overlaps: High +- This track replaces time.sleep() with deterministic polling +- Our audit identified: Arbitrary timeouts make tests brittle +- Synergy: High - our audit recommends eliminating arbitrary sleeps + +### manual_ux_validation_20260302 +- Overlaps: None +- This track validates GUI UX via simulation feedback +- Our audit found: Simulations are low-fidelity emulators +- Synergy: This track depends on simulation framework being improved +## Part 11: Cross-Reference with Existing Tracks + +### test_stabilization_20260302 +- Overlaps: None +- This track addresses asyncio errors and mock-rot ban +- Our audit found: Mock-rot is already structurally banned but enforcement is weak +- Synergy: This audit identifies specific weaknesses in mock provider that stabil track should address + +### codebase_migration_20260302 +- Overlaps: None +- This track restructures to src/ layout +- Our audit focuses on testing infrastructure, not directory structure +- Synergy: Directory restructuring should happen AFTER testing is hardened + +### gui_decoupling_controller_20260302 +- Overlaps: None +- This track extracts state machine from GUI +- Our audit finds: State duplication between App and AppController +- Synergy: Decoupling should include test infrastructure hardening + +### hook_api_ui_state_verification_20260302 +- Overlaps: None +- This track adds /api/gui/state GET endpoint +- Our audit recommends: All tests should use hook server for state verification +- Synergy: High - this enables automated testing our audit recommends + +### robust_json_parsing_tech_lead_20260302 +- Overlaps: None +- This track adds auto-retry for JSON parsing +- Our audit found: Mock provider never produces malformed JSON +- Synergy: Auto-retry won't help if mock always succeeds + +### concurrent_tier_source_tier_20260302 +- Overlaps: None +- This track uses threading.local() for thread-safe logging +- Our audit found: No concurrent access tests +- Synergy: High - threading.local() implementation should include comprehensive testing + +### test_suite_performance_and_flakiness_20260302 +- Overlaps: High +- This track replaces time.sleep() with deterministic polling +- Our audit identified: Arbitrary timeouts make tests brittle +- Synergy: High - our audit recommends eliminating arbitrary sleeps + +### manual_ux_validation_20260302 +- Overlaps: None +- This track validates GUI UX via simulation feedback +- Our audit found: Simulations are low-fidelity emulators +- Synergy: This track depends on simulation framework being improved +## Part 12: Recommendations for Future Tracks + +### Priority 1: Fix Mock Provider (HIGH) + +**Suggested Track:** "mock_provider_enhancement_20260305" + +**Goals:** +- Add failure modes (timeouts, malformed JSON, rate limits) +- Add input validation +- Track tool calls for verification +- Make mock responses configurable via environment variables + +**Files to Modify:** +- tests/mock_gemini_cli.py + +### Priority 2: Fix Auto-Approval Pattern (HIGH) + +**Suggested Track:** "approval_ux_enhancement_20260305" + +**Goals:** +- Remove auto-approval from critical path tests +- Add dialog visibility verification before clicking +- Add rejection flow tests for all approval types +- Test approval fatigue scenarios + +**Files to Modify:** +- tests/test_visual_sim_mma_v2.py +- All simulation tests with approval flows + +### Priority 3: Add Negative Testing (HIGH) + +**Suggested Track:** "negative_path_testing_20260305" + +**Goals:** +- Create comprehensive negative test suite +- Test all rejection flows +- Test error handling (timeouts, network failures, malformed data) +- Test concurrent access patterns +- Test out-of-order event sequences + +**Files to Modify:** +- Create tests/test_negative_flows.py +- Update mock_gemini_cli.py + +### Priority 4: Add State Validation (MEDIUM) + +**Suggested Track:** "state_validation_enhancement_20260305" + +**Goals:** +- Add schema validation using pydantic +- Add state machine invariants testing +- Add thread-safety tests for shared resources +- Add DAG integrity tests (cycles, self-dependencies) + +**Files to Modify:** +- Create tests/test_schemas.py +- Update all MMA and conductor tests + +### Priority 5: Add Visual Verification (MEDIUM) + +**Suggested Track:** "visual_regression_testing_20260305" + +**Goals:** +- Add screenshot comparison infrastructure +- Test modal dialog visibility +- Test text overflow and clipping +- Test layout at different window sizes + +**Files to Modify:** +- Create tests/test_visual_regression.py +- Create tests/baselines/ +- Update all visual and MMA tests + +### Priority 6: Improve Simulation Fidelity (LOW) + +**Suggested Track:** "simulation_fidelity_enhancement_20260305" + +**Goals:** +- Add variable latency simulation +- Add human-like behavior (typing speed, hesitation, mistakes) +- Add realistic delays (not fixed random values) +- Add task switching and distraction simulation + +**Files to Modify:** +- simulation/user_agent.py +- All simulation files + +### Priority 7: Consolidate Test Infrastructure (MEDIUM) + +**Suggested Track:** "test_infrastructure_consolidation_20260305" + +**Goals:** +- Centralize common test patterns (polling helpers, verification helpers) +- Improve fixture cleanup to prevent state pollution +- Add better error reporting and diagnostics +- Standardize mock patterns across all test suites + +**Files to Modify:** +- tests/conftest.py +- Create shared helpers in conftest.py +- Update all tests using new helpers + +### Medium-Term Improvements + +#### Recommendation 8: Update Structural Testing Contract (MEDIUM) + +**Priority:** MEDIUM + +**Description:** Add missing rules to docs/guide_simulations.md + +**New Rules to Add:** +1. Every approval dialog type must have tests for rejection flow +2. All async operations must have timeout/failure tests +3. Every parser must have malformed input tests +4. State validation beyond existence checks required +5. Visual verification required for modal dialogs +6. Thread-safety testing required for shared resources + +**Files to Modify:** +- docs/guide_simulations.md + +### Priority 9: Add Property-Based Testing (MEDIUM) + +**Description:** Use Hypothesis for generative testing + +**Files to Modify:** +- tests/test_properties.py +- Add to requirements: `hypothesis` package + +#### Recommendation 10: Add Fuzzing (MEDIUM) + +**Description:** Add fuzzing for robustness testing + +**Files to Modify:** +- tests/test_fuzzing.py + +### Long-Term Architecture + +#### Recommendation 11: Adopt Test-Driven Development + +**Description:** Move existing test suite toward TDD methodology + +**Current State:** Many tests written after implementation + +**Goal:** Write failing tests first, then implement to make them pass + +**Benefits:** +- Ensures code quality +- Improves test reliability +- Makes refactoring safe + +#### Recommendation 12: Separate Unit and Integration Tests + +**Description:** Ensure unit tests don't depend on live_gui fixture + +**Current Issue:** Many unit tests use live_gui, making them slow and flaky + +**Goal:** Isolate unit logic, use mocks for unit tests, reserve live_gui for true integration tests + +**Benefits:** +- Faster unit test execution +- Clear separation of concerns +- More reliable unit test results + +--- + +## Part 13: Conclusion + +### Summary of Findings + +This audit has revealed **7 critical false positive risks** and **5 major simulation fidelity gaps** that significantly undermine confidence in the test suite: + +**Critical False Positive Risks:** +1. Mock provider always succeeds ? error handling untested +2. Auto-approval never verifies dialogs ? approval UX untested +3. Substring assertions only check existence ? data integrity untested +4. State existence only, no validation ? invariants untested +5. No negative testing ? rejection flows, errors, edge cases untested +6. No visual verification ? rendering bugs never caught + +**Major Simulation Gaps:** +1. No real-time latency simulation +2. No human-like behavior simulation +3. Arbitrary polling intervals miss transient states +4. Mock CLI redirection bypasses subprocess +5. Shallow state verification only +6. No stress testing + +**Impact Assessment:** +- The test suite provides **good architectural boundary checking** but suffers from **critical gaps in error handling, state validation, and UX verification** +- The simulation framework is a **rough emulator** that tests happy path only and masks many real-world failure scenarios +- Existing tracks (stabilization, migration, decoupling) will benefit from this audit's findings + +### Next Steps + +**Immediate:** +1. Review this report with project stakeholders +2. Prioritize recommendations based on severity and impact +3. Plan implementation tracks for highest priority issues + +**For Future Sessions:** +1. Use this report as reference when planning new tracks +2. Ensure new tracks address false positive risks identified here +3. Improve simulation framework before depending on it for critical tests + +--- + +**Report Generation Complete** + +This report was generated by GLM-4.7 through comprehensive skeletal analysis of: entire codebase (src/, tests/, simulation/), architecture documentation review, and pattern-based identification of testing anti-patterns and simulation gaps. + +The report contains 13 major sections covering: +- Full module architecture for src/ (26 modules analyzed) +- Complete test architecture for tests/ (100+ test files analyzed) +- Simulation framework analysis (9 scripts analyzed) +- Deep analysis of 7 critical false positive risks with concrete examples +- Deep analysis of 5 major simulation fidelity gaps +- Specific test file analysis with code examples +- 12 prioritized recommendations +- Architecture strengths, design patterns, and potential concerns +- Cross-reference with existing tracks +- Summary table by severity (7 HIGH priority issues, 10 medium priority issues) +- Long-term architecture recommendations +- Conclusion and next steps + +This exhaustive detail should enable future agents to fully understand findings without needing to re-analyze the codebase. diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md b/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md new file mode 100644 index 0000000..4e9e0ae --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/report_claude.md @@ -0,0 +1,562 @@ +# Test Architecture Integrity Audit — Claude Review + +**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator) +**Review Date:** 2026-03-05 +**Source Report:** report.md (authored by GLM-4.7, 2026-03-04) +**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable +recommendations for downstream tracks. + +**Methodology:** +1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools, + guide_mma, guide_meta_boundary, Readme) +2. Read GLM's full report.md +3. Read plan.md and spec.md for this track +4. Read py_get_skeleton for all 27 src/ modules +5. Read py_get_skeleton for conftest.py and representative test files + (test_extended_sims, test_live_gui_integration, test_dag_engine, + test_mma_orchestration_gui) +6. Read py_get_skeleton for all 9 simulation/ modules +7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history + +--- + +## Section 1: Verdict on GLM's Report + +GLM produced a competent surface-level audit. The structural inventory is +accurate and the broad categories of weakness (mock-rot, shallow assertions, +no negative paths) are valid. However, the report has material errors in +severity classification, contains two exact duplicate sections (Parts 10 and +11 are identical), and misses several issues that are more impactful than +the ones it flags at HIGH. It also makes recommendations that are +architecturally inappropriate for an ImGui immediate-mode application. + +**Confirmed correct:** ~60% of findings +**Overstated or miscategorized:** ~25% of findings +**Missed entirely:** see Section 3 + +--- + +## Section 2: GLM Findings — Confirmed, Corrected, or Rejected + +### 2.1 Confirmed: Mock Provider Never Fails (HIGH) + +GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The +keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`, +default) always produces a well-formed success response. No test using this +mock can ever exercise: +- Malformed or truncated JSON-L output +- Non-zero exit code from the CLI process +- A `{"type": "result", "status": "error", ...}` result event +- Rate-limit or quota responses +- Partial output followed by process crash + +The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A +corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError` +that bubbles up through `_send_gemini_cli`. This path is entirely untested. + +**Severity: HIGH — confirmed.** + +### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH) + +GLM flags this as HIGH. The auto-approval pattern in polling loops is: +```python +if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn') +``` + +This is structurally correct for automated testing — you MUST auto-approve +to drive the pipeline. The actual bug is different from what GLM describes: +the tests never assert that the dialog appeared BEFORE approving. The +correct pattern is: +```python +assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared" +client.click('btn_approve_spawn') +``` + +Without the assert, the test passes even if the dialog never fires (meaning +spawn approval is silently bypassed at the application level). + +**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).** +**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is +required for unattended testing. The fix is to assert the flag is True +*before* clicking. + +There is also zero testing of the rejection path: what happens when +`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error? +Does the track reach "blocked" state? This is an untested state transition. + +### 2.3 Confirmed: Assertions Are Shallow (HIGH) + +GLM is correct. The two canonical examples from simulation tests: +```python +assert len(tickets) >= 2 # structure unknown +"SUCCESS: Mock Tier 3 worker" in streams[tier3_key] # substring only +``` + +Neither validates ticket schema, ID uniqueness, dependency correctness, or +that the stream content is actually the full response and not a truncated +fragment. + +**Severity: HIGH — confirmed.** + +### 2.4 Confirmed: No Negative Path Testing (HIGH) + +GLM is correct. The entire test suite covers only the happy path. Missing: +- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog, + MMASpawnApprovalDialog) +- Malformed LLM response handling (bad JSON, missing fields, unexpected types) +- Network timeout/connection error to Hook API during a live_gui test +- `shell_runner.run_powershell` timeout (60s) expiry path +- `mcp_client._resolve_and_check` returning an error (path outside allowlist) + +**Severity: HIGH — confirmed.** + +### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM) + +GLM is correct. 1-second polling in simulation loops will miss any state +that exists for less than 1 second. The approval dialogs in particular may +appear and be cleared within a single render frame if the engine is fast. + +The `WorkflowSimulator.wait_for_ai_response()` method is the most critical +polling target. It is the backbone of all extended simulation tests. If its +polling strategy is wrong, the entire extended sim suite is unreliable. + +**Severity: MEDIUM — confirmed.** + +### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM) + +GLM is correct. Setting `gcli_path` to a Python script does not exercise: +- Real PATH resolution for the `gemini` binary +- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`) +- Environment variable propagation to the subprocess +- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`) +- The `kill_process_tree` teardown path when the process hangs + +**Severity: MEDIUM — confirmed.** + +### 2.7 CORRECTION: "run_powershell is a Read-Only Tool" + +**GLM is WRONG here.** In Part 8, GLM lists: +> "Read-Only Tools: run_powershell (via shell_runner.py)" + +`run_powershell` executes arbitrary PowerShell scripts against the filesystem. +It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only +because it is not an MCP filesystem tool; its approval gate is the +`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only" +is a factual error that could mislead future workers about the security model. + +### 2.8 CORRECTION: "State Duplication Between App and AppController" + +**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed +before this audit. `gui_2.App` now delegates all state through `AppController` +via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a +thin ImGui rendering layer, `AppController` owns all state. GLM's concern is +stale relative to the current codebase. + +### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure" + +**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui. +These are immediate-mode renderers; there is no DOM or widget tree to +interrogate. Pixel-level screenshot comparison requires platform-specific +capture APIs (Windows Magnification, GDI) and is extremely fragile to font +rendering, DPI, and GPU differences. The Hook API's logical state verification +is the CORRECT and SUFFICIENT abstraction for this application. Adding +screenshot comparison would be high cost, low value, and high flakiness. + +The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`) +is exposing more GUI state via the Hook API so tests can assert logical +rendering state (is a panel visible? what is the modal title?) without pixels. + +### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries + +The summary table in Part 9 lists identical items at multiple severity levels: +- "No concurrent access testing": appears as both HIGH and MEDIUM +- "No real-time latency simulation": appears as both MEDIUM and LOW +- "No human-like behavior": appears as both MEDIUM and LOW +- "Arbitrary polling intervals": appears as both MEDIUM and LOW + +Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference +section was copy-pasted in full. This suggests the report was generated with +insufficient self-review. + +### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation + +GLM spends substantial space on the absence of: +- Typing speed simulation +- Hesitation before actions +- Variable LLM latency + +This is a **personal developer tool for a single user on a local machine**. +These are aspirational concerns for a production SaaS simulation framework. +For this product context, these are genuinely LOW priority. The simulation +framework's job is to verify that the GUI state machine transitions correctly, +not to simulate human psychology. + +--- + +## Section 3: Issues GLM Missed + +These are findings not present in GLM's report that carry meaningful risk. + +### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests + +`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL +tests that use `live_gui` share a single running GUI process. If test A +leaves the GUI in a state with an open modal dialog, test B will find the +GUI unresponsive or in an unexpected state. + +The teardown calls `client.reset_session()` (which clicks `btn_reset_session`), +but this clears AI state and discussion history, not pending dialogs or +MMA orchestration state. A test that triggers a spawn approval dialog and +then fails before approving it will leave `_pending_mma_spawn` set, blocking +the ENTIRE remaining test session. + +**Severity: HIGH.** The current test ordering dependency is invisible and +fragile. Tests must not be run in arbitrary order. + +**Fix:** Each `live_gui`-using test that touches MMA or approval flows should +explicitly verify clean state at start: +```python +status = client.get_mma_status() +assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty" +``` + +### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering + +The `app_instance` fixture mocks out all ImGui rendering. This means every +test using `app_instance` (approximately 40+ tests) is testing Python object +state, not rendered UI. Tests like: +- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`, + not that the panel renders +- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls + `_render_token_budget_panel()` in a context where all ImGui calls are no-ops + +This creates a systematic false-positive class: a method can be completely +broken (wrong data, missing widget calls) and the test passes because ImGui +calls are silently ignored. The only tests with genuine rendering fidelity +are the `live_gui` tests. + +This is the root cause behind GLM's "state existence only" finding. It is +not a test assertion weakness — it is a fixture architectural limitation. + +**Severity: HIGH.** The implication: all `app_instance`-based rendering +tests should be treated as "smoke tests that the method doesn't crash," +not as "verification that the rendering is correct." + +**Fix:** The `hook_api_ui_state_verification_20260302` track (adding +`/api/gui/state`) is the correct path forward: expose render-visible state +through the Hook API so `live_gui` tests can verify it. + +### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block + +`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop. +There is no outer timeout on this loop. If the GUI thread never signals the +dialog (e.g., GUI crash after dialog creation, or a test that creates a +dialog but doesn't render it), the asyncio worker thread hangs indefinitely. + +This is particularly dangerous in the `run_worker_lifecycle` path: +1. Worker pushes dialog to event queue +2. GUI process crashes or freezes +3. `dialog.wait()` loops forever at 0.1s intervals +4. Test session hangs with no error output + +There is no test verifying that `wait()` has a maximum wait time and raises +an exception or returns a default (rejected) decision after it. + +**Severity: HIGH.** + +### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests + +`mcp_client.configure()` sets module-level globals (`_allowed_paths`, +`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions +directly without calling `configure()` first will use whatever state was +left from the previous test. The `reset_ai_client` autouse fixture calls +`ai_client.reset_session()` but does NOT reset `mcp_client` state. + +Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`, +etc. directly (not through `ai_client.send()`) inherits the allowlist from +the previous test run. This can cause false passes (path permitted by +previous test's allowlist) or false failures (path denied because +`_base_dirs` is empty from a prior reset). + +**Severity: MEDIUM.** + +### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption + +GLM mentions this as a "design concern." It is more specific: the +`concurrent_tier_source_tier_20260302` track exists because `current_tier` +in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers +run concurrently (future feature), the second `send()` call will overwrite +the first worker's tier tag. + +What's missing: there is no test that verifies the CURRENT behavior is safe +under single-threaded operation, and no test that demonstrates the failure +mode under concurrent operation to serve as a regression baseline for the fix. + +**Severity: MEDIUM.** + +### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime + +The arch boundary tests verify that `manual_slop.toml` lists mutating tools +as disabled by default. But the tests don't verify: +1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools` + at startup +2. That `ai_client._agent_tools` is actually consulted before tool dispatch +3. That the TOML → runtime path is end-to-end + +A developer could modify how tools are loaded without breaking these tests. +The tests are static config audits, not runtime enforcement tests. + +**Severity: MEDIUM.** + +### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly + +From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client` +and calls `ai_client.send()` to generate "human-like" responses. This means: +- Simulation tests have an implicit dependency on a configured LLM provider +- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent + level, not at the GUI level — making failures hard to diagnose +- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in + the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent + would attempt real API calls + +No test documents whether UserSimAgent is actually exercised in the extended +sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient +directly to drive the GUI. + +**Severity: MEDIUM.** + +### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised + +The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream +and then waits for `{"type": "tool_result", ...}` piped back on stdin. The +`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects +`'"role": "tool"'` in the prompt to simulate a post-tool-call turn. + +This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that +handles `tool_use` event types and accumulates them) is NEVER exercised by +any test. A regression in that parsing branch would be invisible to the +test suite. + +**Severity: LOW** (only relevant when the real gemini CLI is used with tools). + +### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests + +The `reset_ai_client` autouse fixture runs synchronously before each test. +For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's +async setup. If the async test itself triggers ai_client operations in setup +(e.g., through an event loop created by the fixture), the reset may not +capture all state mutations. This is an edge case but could explain +intermittent behavior in async tests. + +**Severity: LOW.** + +--- + +## Section 4: Revised Severity Matrix + +| Severity | Finding | GLM? | Source | +|---|---|---|---| +| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM | +| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude | +| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude | +| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude | +| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM | +| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM | +| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude | +| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude | +| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude | +| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude | +| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude | +| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM | +| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM | +| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude | +| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude | +| **LOW** | Variable latency / human-like simulation | Confirmed | GLM | + +--- + +## Section 5: Prioritized Recommendations for Downstream Tracks + +Listed in execution order, not importance order. Each maps to an existing or +proposed track. + +### Rec 1: Extend mock_gemini_cli with Failure Modes +**Target track:** New — `mock_provider_hardening_20260305` +**Files:** `tests/mock_gemini_cli.py` +**What:** Add a `MOCK_MODE` environment variable selector: +- `success` (current behavior, default) +- `malformed_json` — emit a truncated/corrupt JSON-L line +- `error_result` — emit `{"type": "result", "status": "error", ...}` +- `timeout` — sleep 90s to trigger the CLI timeout path +- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing + +Tests that need to verify error handling pass `MOCK_MODE=error_result` via +`client.set_value()` before triggering the AI call. + +### Rec 2: Add Dialog Assertion Before Auto-Approval +**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned) +**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py` +**What:** Replace the conditional approval pattern: +```python +# BAD (current): +if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn') +# GOOD: +assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve" +client.click('btn_approve_spawn') +``` +Also add at least one test per dialog type that clicks reject and asserts the +correct downstream state (engine marks track blocked, no worker spawned, etc.). + +### Rec 3: Fix live_gui Session Scope Dirty State +**Target track:** `test_suite_performance_and_flakiness_20260302` +**Files:** `tests/conftest.py` +**What:** Add a per-test autouse fixture (function-scoped) that asserts clean +GUI state before each `live_gui` test: +```python +@pytest.fixture(autouse=True) +def assert_gui_clean(live_gui): + client = ApiHookClient() + status = client.get_mma_status() + assert not status.get('pending_mma_spawn_approval') + assert not status.get('pending_mma_step_approval') + assert not status.get('pending_tool_approval') + assert status.get('mma_status') in ('idle', 'done', '') +``` +This surfaces inter-test pollution immediately rather than causing a +mysterious hang in a later test. + +### Rec 4: Add ConfirmDialog Timeout Test +**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`) +**Files:** `tests/test_conductor_engine.py` +**What:** Add a test that creates a `ConfirmDialog`, never signals it, and +verifies after N seconds that the background thread does NOT block indefinitely. +This requires either a hard timeout on `wait()` or a documented contract that +callers must signal the dialog within a finite window. + +### Rec 5: Expose More State via Hook API +**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority) +**Files:** `src/api_hooks.py` +**What:** This track is the key enabler for replacing `app_instance` rendering +tests with genuine state verification. The planned `/api/gui/state` endpoint +should expose: +- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`) +- `ui_focus_agent` current filter value +- `_mma_status`, `_ai_status` text values +- Panel visibility flags + +Once this is in place, the `app_instance` rendering tests can be migrated +to `live_gui` equivalents that actually verify GUI-visible state. + +### Rec 6: Add mcp_client Reset to autouse Fixture +**Target track:** `test_suite_performance_and_flakiness_20260302` +**Files:** `tests/conftest.py` +**What:** Extend `reset_ai_client` autouse fixture to also call +`mcp_client.configure([], [])` to clear the allowlist between tests. +This prevents allowlist state from a previous test from leaking into the next. + +### Rec 7: Add Runtime HITL Enforcement Test +**Target track:** `test_suite_performance_and_flakiness_20260302` or new +**Files:** `tests/test_arch_boundary_phase2.py` +**What:** Add an integration test (using `app_instance`) that: +1. Calls `ai_client.set_agent_tools({'set_file_slice': True})` +2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'` +3. Triggers a dispatch of `set_file_slice` +4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred + +This closes the gap between "config says mutating tools are off" and +"runtime actually gates them through the approval callback." + +### Rec 8: Document `app_instance` Limitation in conftest +**Target track:** Any ongoing work — immediate, no track needed +**Files:** `tests/conftest.py` +**What:** Add a docstring to `app_instance` fixture: +```python +""" +App instance with all ImGui rendering calls mocked to no-ops. +Use for unit tests of state logic and method existence. +DO NOT use to verify rendering correctness — use live_gui for that. +""" +``` +This prevents future workers from writing rendering tests against this fixture +and believing they have real coverage. + +--- + +## Section 6: What the Existing Track Queue Gets Right + +The `TASKS.md` strict execution queue is well-ordered for the test concerns: + +1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban +2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring +3. `codebase_migration_20260302` → Already complete (commit 270f5f7) +4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205) +5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests +6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces + malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above +7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers +8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination + +The `test_architecture_integrity_audit_20260304` (this track) sits logically +between #1 and #5 — it provides the analytical basis for what #5 and #8 need +to fix. The audit output (this document) should be read by the Tier 2 Tech Lead +for both those tracks. + +The proposed new tracks (mock_provider_hardening, negative_path_testing) from +GLM's recommendations are valid but should be created AFTER track #5 +(`hook_api_ui_state_verification`) is complete, since they depend on the +richer Hook API state to write meaningful assertions. + +--- + +## Section 7: Architectural Observations Not in GLM's Report + +### The Two-Tier Mock Problem + +The test suite has two completely separate mock layers that do not know about +each other: + +**Layer 1** — `app_instance` fixture (in-process): Patches `immapp.run()`, +`ai_client.send()`, and related functions with `unittest.mock`. Tests call +methods directly. No network, no subprocess, no real threading. + +**Layer 2** — `mock_gemini_cli.py` (out-of-process): A fake subprocess that +the live GUI process calls through its own internal LLM pipeline. Tests drive +this via `ApiHookClient` HTTP calls to the running GUI process. + +These layers test completely different things. Layer 1 tests Python object +invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC, +process management). Most of the test suite is Layer 1. Very few tests are +Layer 2. The high-value tests are Layer 2 because they exercise the actual +system, not a mock of it. + +GLM correctly identifies that Layer 1 tests are of limited value for +rendering verification but does not frame it as a two-layer architecture +problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification). + +### The Simulation Framework's Actual Role + +The `simulation/` module is not (and should not be) a fidelity benchmark. +Its role is: +1. Drive the GUI through a sequence of interactions +2. Verify the GUI reaches expected states after each interaction + +The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`, +`sim_execution.py`) are extremely thin wrappers. Their actual test value +comes from `test_extended_sims.py` which calls them against a live GUI and +verifies no exceptions are thrown. This is essentially a smoke test for the +GUI lifecycle, not a behavioral verification. + +The real behavioral verification is in `test_visual_sim_mma_v2.py` and +similar files that assert specific state transitions. The simulation/ +module should be understood as "workflow drivers," not "verification modules." + +GLM's recommendation to add latency simulation and human-like behavior to +`simulation/user_agent.py` would add complexity to a layer that isn't the +bottleneck. The bottleneck is assertion depth in the polling loops, not +realism of the user actions. + +--- + +*End of report. Next action: Tier 2 Tech Lead to read this alongside +`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`) +as the highest-leverage unblocking action.* diff --git a/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md b/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md new file mode 100644 index 0000000..1a43afe --- /dev/null +++ b/conductor/tracks/test_architecture_integrity_audit_20260304/spec.md @@ -0,0 +1,96 @@ +# Track Specification: Test Architecture Integrity & Simulation Audit + +## Overview +Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators. + +## Current State Audit (as of 20260304) + +### Already Implemented (DO NOT re-implement) +- **Testing Infrastructure** ( ests/conftest.py): + - live_gui fixture for session-scoped GUI lifecycle management + - Process cleanup with kill_process_tree() + - VerificationLogger for diagnostic logging + - Artifact isolation to ests/artifacts/ and ests/logs/ + +- **Simulation Framework** (simulation/): + - sim_base.py: Base simulation class with setup/teardown + - workflow_sim.py: Workflow orchestration + - sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py + - user_agent.py: Simulated human agent +- **Testing Infrastructure** (tests/conftest.py): + - live_gui fixture for session-scoped GUI lifecycle management + - Process cleanup with kill_process_tree() + - VerificationLogger for diagnostic logging + - Artifact isolation to tests/artifacts/ and tests/logs/ + - Ban on arbitrary core mocking +- **Mock Provider** (tests/mock_gemini_cli.py): + - Keyword-based response routing + - JSON-L protocol matching real CLI output + +#### Critical False Positive Risks Identified +1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths +2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows +3. **Substring-Based Assertions**: Only check existence of content, not validity or structure +4. **State Existence Only**: Tests check fields exist but not their correctness or invariants +5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access +6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered +7. **No State Machine Validation**: No verification that status transitions are legal or complete + +#### Simulation Rigor Gaps Identified +1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency +2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching +3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states +4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths +5. **No Stress Testing**: No load testing, no edge case bombardment + +#### Test Coverage Gaps +- No tests for approval dialog rejection flows +- No tests for malformed LLM response handling +- No tests for network timeout/failure scenarios +- No tests for concurrent duplicate requests +- No tests for out-of-order event sequences +- No thread-safety tests for shared resources +- No visual rendering verification (modal visibility, text overflow, color schemes) + +#### Structural Testing Contract Gaps +- Missing rule requiring negative path testing +- Missing rule requiring state validation beyond existence +- Missing rule requiring visual verification +- No enforcement for thread-safety testing + +## Goals + +1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW) +2. Create actionable recommendations for each identified issue +3. Map existing test coverage gaps to specific missing test files +4. Provide architecture recommendations for simulation framework enhancements + +## Functional Requirements + +- [ ] Document all false positive risks in a structured format +- [ ] Document all simulation fidelity gaps in a structured format +- [ ] Create severity matrix for each issue +- [ ] Generate list of missing test cases by category +- [ ] Provide concrete examples of how current tests would pass despite bugs +- [ ] Provide concrete examples of how simulations would miss UX issues + +## Non-Functional Requirements + +- Report must include author attribution (GLM-4.7) and derivation methodology +- Analysis must cite specific file paths and line numbers where applicable +- Recommendations must be prioritized by impact and implementation effort + +## Architecture Reference + +Refer to: +- docs/guide_simulations.md - Current simulation contract and patterns +- docs/guide_mma.md - MMA orchestration architecture +- docs/guide_architecture.md - Thread domains, event system, HITL mechanism +- conductor/tracks/*/spec.md - Existing track specifications for consistency + +## Out of Scope + +- Implementing the actual test fixes (that's for subsequent tracks) +- Refactoring the simulation framework (documenting only) +- Modifying the mock provider (analyzing only) +- Writing new tests (planning phase for future tracks)