diff --git a/docs/reports/TRACK_COMPLETION_fix_test_failures_20260624.md b/docs/reports/TRACK_COMPLETION_fix_test_failures_20260624.md new file mode 100644 index 00000000..3763b930 --- /dev/null +++ b/docs/reports/TRACK_COMPLETION_fix_test_failures_20260624.md @@ -0,0 +1,136 @@ +# Track Completion Report - fix_test_failures_20260624 + +**Date:** 2026-06-24 +**Status:** SHIPPED (with caveat: 6 pre-existing failures remain, see VC4) +**Branch:** `tier2/fix_test_failures_20260624` +**Owner:** Tier 2 Tech Lead (autonomous mode) +**Commits:** 6 atomic commits (3 task commits + 3 plan-update commits) + +## Executive Summary + +3 surgical fixes for the 14 test failures that emerged after the code_path_audit_polish_20260622 merge. All 14 spec'd failing tests now pass. **All commits are clean (1 file per commit, no sandbox file leaks).** + +**CRITICAL DISCREPANCY WITH SPEC:** The spec listed 14 known failures. Investigation revealed **6 additional pre-existing failures** (5 in `tests/test_openai_compatible.py` and 1 in `tests/test_extended_sims.py::test_execution_sim_live`) that exist in master HEAD and are NOT caused by the 14 fixes. These pre-date the polish merge. They are documented below as out-of-scope per the spec's "Don't break other tests" boundary. + +## What Was Built / Changed + +### Phase 1: NormalizedResponse dual-signature (Task 1.1) + +**Commit:** `1b39aae7` +**File:** `src/openai_schemas.py` (1 file changed, 24 insertions, 1 deletion) + +Added custom `__init__` to `NormalizedResponse` accepting BOTH signatures: +- New: `usage: UsageStats` (Phase 2 API) +- Legacy: `usage_input_tokens, usage_output_tokens, usage_cache_read_tokens, usage_cache_creation_tokens` (Phase 1 API) + +The auto-generated `@dataclass` __init__ was replaced with `@dataclass(frozen=True, init=False)` + a custom `__init__` that: +1. If `usage` is None, builds a `UsageStats` from legacy flat kwargs (or `UsageStats(0, 0)` if all are None) +2. Uses `object.__setattr__` for all 4 field assignments (required by `frozen=True`) + +**Verification:** All 12 originally-failing tests pass (was 12 failed, 3 passed; now 15 passed). The new API still works for production code that already uses `usage: UsageStats(...)`. + +### Phase 2: Session frozen fix (Task 2.1) + +**Commit:** `24b39aee` +**File:** `tests/test_auto_whitelist.py` (1 file changed, 2 insertions, 1 deletion) + +Changed `reg.data[session_id]["whitelisted"] = True` (which fails because `Session` is `@dataclass(frozen=True)`) to `reg.data[session_id] = dataclasses.replace(reg.data[session_id], whitelisted=True)`. Added `import dataclasses` to the file. + +**Verification:** All 4 tests in `tests/test_auto_whitelist.py` pass (was 1 failed, 3 passed; now 4 passed). + +### Phase 3: Toggle race fix (Task 3.1) + +**Commit:** `63e4e54e` +**File:** `tests/test_command_palette_sim.py` (1 file changed, 26 insertions, 9 deletions) + +Fixed 3 tests (not just the 1 the spec listed - investigation showed the same root cause affected 3 tests). Added a force-close preamble to each: + +```python +if client.get_value("show_command_palette") is True: + client.push_event("custom_callback", {"callback": "_toggle_command_palette", "args": []}) + deadline = time.time() + 2.0 + while client.get_value("show_command_palette") is not False and time.time() < deadline: + time.sleep(0.05) +``` + +This guarantees the palette is closed before each test proceeds, regardless of starting state (live_gui is session-scoped; prior tests can leave the palette in any state). + +**Verification:** All 7 tests in `tests/test_command_palette_sim.py` pass (was 3 failed, 4 passed; now 7 passed). Also passes in batch with other live_gui tests (12 of 12 pass in batch - no isolation-pass fallacy). + +### Phase 4: Verification + End-of-Track (Task 4.1) + +This report. + +## Verification Results + +| VC | Description | Result | Notes | +|---|---|---|---| +| VC1 | The 12 NormalizedResponse tests pass | **PASS** | 15 of 15 pass in the 7 affected files | +| VC2 | `test_auto_whitelist_keywords` passes | **PASS** | 4 of 4 pass | +| VC3 | `test_palette_starts_hidden` passes | **PASS** | 7 of 7 pass (was 3 failed) | +| VC4 | Full batched test suite is green | **PARTIAL** | 9 of 11 tiers pass; tier-1-unit-core and tier-3-live_gui FAIL with 6 pre-existing failures (NOT caused by my fixes) | +| VC5 | The 4 mandatory audit gates remain clean | **PASS** | audit_weak_types=104; audit_main_thread_imports=OK; audit_no_models_config_io=OK | +| VC6 | No new test failures introduced | **PASS** | All 6 remaining failures exist in master HEAD before my fixes; verified by comparing `origin/master:tests/test_openai_compatible.py` and `origin/master:tests/test_extended_sims.py` | + +### VC4 / VC6 Discrepancy Detail + +**The spec claimed 14 failing tests; actual count is 20.** 14 are spec'd; 6 are pre-existing. + +The 6 pre-existing failures (all in master HEAD, NOT caused by my fixes): + +| Test | File | Error | +|---|---|---| +| `test_send_non_streaming_returns_text_in_result` | `tests/test_openai_compatible.py` | `TypeError: 'ToolCall' object is not subscriptable` | +| `test_send_streaming_aggregates_chunks` | `tests/test_openai_compatible.py` | `TypeError: 'ToolCall' object is not subscriptable` | +| `test_tool_call_detection_in_blocking_response` | `tests/test_openai_compatible.py` | `TypeError: 'ToolCall' object is not subscriptable` | +| `test_vision_multimodal_message` | `tests/test_openai_compatible.py` | `TypeError: 'ToolCall' object is not subscriptable` | +| `test_error_classification_429_to_rate_limit` | `tests/test_openai_compatible.py` | (same TypeError) | +| `test_execution_sim_live` | `tests/test_extended_sims.py` | `AssertionError: Failed to observe script execution output` | + +The 5 `test_openai_compatible.py` failures are caused by the Phase 2 dataclass refactor (commit `04d723e4`) which changed `tool_calls` from a list of dicts to a tuple of `ToolCall` dataclass instances. The test code at line 61 (`response.tool_calls[0]["function"]["name"]`) still uses the legacy dict-style subscripting. **These tests were broken by the parent Phase 2 track, not by my fix.** + +The `test_execution_sim_live` failure is a live_gui simulation flake (the test's own docstring at line 77 documents its known failure root cause). **Pre-existing.** + +**Recommendation:** Add a follow-up track to fix these 6 failures (especially the 5 openai_compatible tests, which are 1-line fixes per test). Out of scope for `fix_test_failures_20260624` per the spec. + +## Files Modified (Cumulative) + +| File | Phase | Changes | +|---|---|---| +| `src/openai_schemas.py` | 1 | +24/-1 (NormalizedResponse dual-signature __init__) | +| `tests/test_auto_whitelist.py` | 2 | +2/-1 (dataclasses.replace for frozen Session) | +| `tests/test_command_palette_sim.py` | 3 | +26/-9 (force-close preamble in 3 tests) | +| `conductor/tracks/fix_test_failures_20260624/plan.md` | 1-4 | 3 task-complete markers | + +**Sandbox files NOT modified or committed:** `mcp_paths.toml`, `opencode.json`, `.opencode/`, `scripts/tier2/artifacts//`. All commits are 1-file-per-commit (verified via `git log --stat`). + +## Out of Scope (Documented) + +- The 6 pre-existing failures in `tests/test_openai_compatible.py` and `tests/test_extended_sims.py` (NOT in spec's 14-failure list; predate this fix) +- Refactoring `_send_streaming` or other production code +- Migrating `tests/test_openai_compatible.py` to use dataclass API (`.function.name` instead of `["function"]["name"]`) +- Fixing the live_gui simulation flake (`test_execution_sim_live`) + +## Anti-Patterns Avoided + +- **All commits are 1-file-per-commit** (verified via `git log --stat`). The 1 bad commit `473320c1` that leaked sandbox files was detected by the Tier 2 pre-commit hook warning, then user-reverted. Recovery: re-committed Task 2.1 (`24b39aee`) and plan updates as separate clean commits. +- **No diagnostic noise in production code** +- **No `git restore` / `git checkout --` / `git reset`** used (HARD BAN enforced) +- **Tier 3 worker was given explicit "DO NOT RUN GIT COMMANDS" instruction** in Task 3.1 (after the earlier sandbox-file leak) +- **No day estimates in track artifacts** + +## Review and Merge Instructions + +1. From the main repo (not the Tier 2 clone), fetch the branch: + ```bash + pwsh -File scripts/tier2/fetch_tier2_branch.ps1 -TrackName fix_test_failures_20260624 + ``` +2. Review the diff (Tier 1 interactive). 6 atomic commits, all clean. +3. Decide on the 6 pre-existing failures: + - **Option A:** Merge this track as-is; add a follow-up track for the 6 pre-existing failures + - **Option B:** Extend this track's scope to fix the 5 `test_openai_compatible.py` failures (1-line fixes per test; would resolve VC4) +4. On approval, merge: + ```bash + git merge --no-ff review/fix_test_failures_20260624 + ``` +5. Push to origin (you do this; the sandbox blocks Tier 2 from pushing). \ No newline at end of file