diff --git a/conductor/tracks/doeh_test_thinking_cleanup_20260615/spec.md b/conductor/tracks/doeh_test_thinking_cleanup_20260615/spec.md new file mode 100644 index 00000000..c82c321b --- /dev/null +++ b/conductor/tracks/doeh_test_thinking_cleanup_20260615/spec.md @@ -0,0 +1,277 @@ +# Track: Data-Oriented Error Handling Test & Thinking-Parser Cleanup + +**Status:** Active (spec approved 2026-06-15) +**Initialized:** 2026-06-15 +**Owner:** Tier 2 Tech Lead +**Priority:** High (1 critical production regression + 10+ test mock fixes + 2 deferred bugs) + +--- + +## 1. Overview + +This track is the **cleanup follow-up** to two previously-completed tracks: `data_oriented_error_handling_20260606` (shipped 2026-06-12) and `ai_loop_regressions_20260614` (shipped 2026-06-15). It consolidates 3 categories of remaining work into a single deliverable: + +1. **A new production regression** introduced by `ai_loop_regressions_20260614` commit `2b7b571a` (FR2 fix): the `_api_generate` function in `src/app_controller.py:265-295` references an undefined variable `context_to_send`, causing `/api/v1/generate` to return HTTP 500 on every call. This bug was not caught by the previous track's smoke tests (which only verified Hook API substrate reachability) and was missed in the Tier 1 review (which relied on the test pass count, not direct code inspection of the FR2 diff). + +2. **10 pre-existing test mock bugs** from the `data_oriented_error_handling_20260606` refactor: tests that call `_send_()` and assert against raw `str` return values, while the production code now returns `Result[str]`. Mechanical fixes (`assert result.ok and result.data == "x"` instead of `assert result == "x"`). + +3. **2 deferred bugs** from `ai_loop_regressions_20260614` spec §13: Gemini / Gemini CLI thinking-format compatibility (Bug #4) and `` (half-width) marker support in `thinking_parser` (Bug #5). + +Plus 2 housekeeping items discovered during Tier 1 review of `ai_loop_regressions_20260614`: the duplicate-key bug in that track's `state.toml` (which makes the file unparseable by Python's `tomllib`), and the `tracks.md` row 24 that was never updated to mark the track complete. + +This track does NOT include (deferred to separate tracks — see §13): +- The `public_api_migration_20260606` follow-up (5 production + 63 test call sites not migrated to `send_result()`) +- A `live_gui_mock_injection` infrastructure track (would unblock proper end-to-end live_gui + AI client tests) +- Pre-existing RAG flakiness (`test_rag_phase4_final_verify`) +- The UI Polish Five Issues phases (2 unrelated test failures covered by that track) + +## 2. Goals (Priority Order) + +| Priority | Goal | Rationale | +|---|---|---| +| **A (critical)** | Fix the `_api_generate` `NameError` regression introduced by `ai_loop_regressions_20260614` commit `2b7b571a` | Production bug: `/api/v1/generate` returns HTTP 500 on every call. The fix is small (~3 lines: add back the `_disc_entries_lock` acquisition and `context_to_send = stable_md if not has_ai_response else ""`). A failing test (`test_headless_service.test_generate_endpoint`) is the canary. | +| **A (primary value)** | Fix the 10 pre-existing test mock bugs from `data_oriented_error_handling_20260606` | The test suite has 10+ red tests that are all the same mechanical pattern. Fixing them gets the test suite back to green. Each test is a 1-line change (use `result.data` or `result.ok` checks). | +| **B (architectural)** | Investigate and fix the Gemini / Gemini CLI thinking-format compatibility (Bug #4) | The user complained that thinking monologues don't render for Gemini. Empirical investigation needed: run a Gemini request, inspect `resp.text`, determine if a normalization pass is needed in `_send_gemini*`. | +| **B (architectural)** | Add `` (half-width) marker support to `thinking_parser.py` | User screenshot 1 showed `...` format. The current regex at `src/thinking_parser.py:9` requires the full-width ``. Small change (~3 lines + tests). | +| **C (housekeeping)** | Fix the `state.toml` duplicate-key bug in `ai_loop_regressions_20260614` | The state file is unparseable by Python's `tomllib` due to TOML §3.3.1 "Cannot overwrite a value". The fix is deleting lines 23-26 and 46-58. This blocks archival of the parent track. | +| **C (housekeeping)** | Update `conductor/tracks.md` row 24 to reflect completion of `ai_loop_regressions_20260614` | The track was completed on 2026-06-15 but the row still says "spec ✓, plan ✓, ready to start". | +| **C (verification)** | Full test suite sweep + `docs/guide_ai_client.md` "See Also" section update | Document the new `Result` API test patterns and the deferred items. | + +### 2.1 Non-Goals (this track) + +- **Not** migrating the remaining 5 production + 63 test call sites to `send_result()`. That is `public_api_migration_20260606`, a separate planned track with its own scope. This track only fixes the broken `_api_generate` site (which is the only newly-introduced production regression) and the 10+ tests that would be touched by the public_api migration. +- **Not** introducing a `live_gui_mock_injection` infrastructure. That's a separate concern (test infrastructure) requiring subprocess mock injection. Recommended as its own track. +- **Not** fixing the pre-existing RAG flakiness in `test_rag_phase4_final_verify`. That test had a partial fix in commit `16412ad5` (RAG Phase 4 dim-mismatch) and a subsequent failure with `'NoneType' object has no attribute 'get'`. This is a RAG subsystem concern, not an AI client test mock concern. +- **Not** fixing `test_discussion_truncate_layout.py::test_keep_pairs_input_uses_adequate_width` and `test_log_management_refresh.py::test_refresh_registry_button_calls_load_registry`. These are Phase 2 and Phase 3 of the UI Polish Five Issues track, which has its own plan and spec. The 2 failing tests are correctly identified as out-of-scope here. +- **Not** adding a CI gate or audit script. The existing `scripts/audit_*.py` scripts don't check for this category of regression (test mocks that don't match the new return types). +- **Not** removing the deprecated `ai_client.send()` shim. That's `public_api_migration_20260606`. + +## 3. Current State Audit (as of commit `515ef933`) + +### 3.1 Already Implemented (DO NOT re-implement) + +- **`src/result_types.py`**: `Result[T]`, `ErrorInfo`, `ErrorKind` dataclasses exist; the new convention is fully established. +- **`src/ai_client.py:send_result()`** (lines 2970-3092): the new public entry point, returns `Result[str]`. Routes to `_send__result()` per provider. +- **`src/ai_client.py:send()`** (lines 2907-2968): the `@deprecated` shim, returns `result.data` (empty string on error). +- **`src/ai_client.py:_send_*_result()`** (9 vendors): all return `Result[str]`. +- **`src/ai_client.py:run_with_tool_loop()`** (lines 734-836): now has `wrap_reasoning_in_text: bool = False` kwarg (added by `ai_loop_regressions_20260614` FR3 fix). +- **`src/app_controller.py:_handle_request_event`** (lines 3673-3697): uses `send_result()` + `result.ok` branching (fixed by `ai_loop_regressions_20260614` FR1). +- **`src/app_controller.py:_api_generate_sync`** (line 3692): also updated by FR1 (the 2nd `except ProviderError` site was already replaced; the `try`/`except` was also restructured). +- **`src/thinking_parser.py:parse_thinking_trace()`** (lines 8-54): supports ``, ``, and `Thinking:` prefix markers. + +### 3.2 Gaps to Fill (This Track's Scope) + +#### G1: `_api_generate` NameError regression (CRITICAL) + +**File:line**: `src/app_controller.py:265-295` (the `_api_generate` function) +**Bug introduced by**: `ai_loop_regressions_20260614` commit `2b7b571a` (FR2 fix) +**Symptom**: `/api/v1/generate` returns HTTP 500 with `NameError: name 'context_to_send' is not defined` +**Root cause**: The FR2 fix removed the `try:` block (which contained the `with controller._disc_entries_lock:` acquisition and the `context_to_send = stable_md if not has_ai_response else ""` assignment) and replaced it with a `send_result()` call that still references `context_to_send`. The variable definition was lost. + +The current state at `src/app_controller.py:278`: +```python +result = ai_client.send_result(context_to_send, user_msg, base_dir, ...) # context_to_send is undefined +``` + +The fix needs to add back the 2 lines BEFORE line 278: +```python +with controller._disc_entries_lock: + has_ai_response = any(e.get("role") == "AI" for e in controller.disc_entries) +context_to_send = stable_md if not has_ai_response else "" +``` + +**Failing test**: `tests/test_headless_service.py::TestHeadlessAPI::test_generate_endpoint` (currently returns 500). + +#### G2-G11: 10 pre-existing test mock bugs from `data_oriented_error_handling_20260606` + +All have the same root cause: the tests were written before the refactor when `_send_()` returned `str`; the production code now returns `Result[str]`. The fix is mechanical: change `assert result == "x"` to `assert result.ok and result.data == "x"`, and `assert "text" in result` to `assert result.ok and "text" in result.data`. + +| # | File:line | Test | Current assertion | Fix | +|---|---|---|---|---| +| **G2** | `tests/test_llama_provider.py:22` | `test_send_grok_uses_xai_endpoint` (wait, this is in test_grok_provider) | `assert result == "hi from grok"` | `assert result.ok and result.data == "hi from grok"` | +| **G3** | `tests/test_grok_provider.py:13` | `test_send_grok_uses_xai_endpoint` | `assert result == "hi from grok"` | `assert result.ok and result.data == "hi from grok"` | +| **G4** | `tests/test_grok_provider.py:30` | `test_grok_web_search_adds_search_parameters_to_extra_body` | `assert len(captured_kwargs) == 1` (got 12) | Loop now calls the mock 12 times; update to `assert any(kw["extra_body"] is not None and kw["extra_body"].get("search_parameters", {}).get("mode") == "auto" for kw in captured_kwargs)` | +| **G5** | `tests/test_grok_provider.py:46` | `test_grok_x_search_adds_x_source_to_extra_body` | `assert captured_kwargs[0]["extra_body"]["search_parameters"]["sources"] == [{"type": "x"}]` | Same as G4 — change to check across all kwargs | +| **G6** | `tests/test_llama_provider.py:24` | `test_send_llama_openrouter_backend` | `assert result == "hi from openrouter"` | `assert result.ok and result.data == "hi from openrouter"` | +| **G7** | `tests/test_llama_provider.py:43` | `test_send_llama_custom_url` | `assert result == "hi from custom"` | `assert result.ok and result.data == "hi from custom"` | +| **G8** | `tests/test_llama_provider.py:62` | `test_send_llama_ollama_backend` | `assert "hi from ollama" in result` | `assert result.ok and "hi from ollama" in result.data` | +| **G9** | `tests/test_llama_ollama_native.py:70` | `test_send_llama_native_calls_ollama_chat_when_localhost` | `assert "hi from native ollama" in result` | `assert result.ok and "hi from native ollama" in result.data` | +| **G10** | `tests/test_llama_ollama_native.py:88` | `test_send_llama_native_preserves_thinking_field` | `assert "I thought about it" in result` | `assert result.ok and "I thought about it" in result.data` | +| **G11** | `tests/test_llama_ollama_native.py:107` | `test_send_llama_routes_to_native_when_localhost` | `assert "via native" in result` | `assert result.ok and "via native" in result.data` | +| **G12** | `tests/test_llama_ollama_native.py:122` | `test_send_llama_keeps_openai_path_for_non_local` | `assert "via openrouter" in result` | `assert result.ok and "via openrouter" in result.data` | +| **G13** | `tests/test_ai_client_tool_loop_builder.py:22` | `test_run_with_tool_loop_calls_request_builder_each_round` | Mock returns raw `NormalizedResponse`; `_default_send` now does `if not res.ok:` expecting `Result[NormalizedResponse]` | Wrap the mock return in `Result(data=...)` | +| **G14** | `tests/test_headless_service.py:57` | `test_generate_endpoint` | Mocks `ai_client.send` (deprecated); production now uses `send_result`. Plus the G1 NameError. | Update mock to `ai_client.send_result` returning `Result(data="AI Response")`; this test will pass after G1 is fixed | + +#### G15: Gemini / Gemini CLI thinking-format compatibility (Bug #4 deferred from `ai_loop_regressions_20260614`) + +**File:line**: `src/ai_client.py:_send_gemini` (lines 1538-1781) and `src/ai_client.py:_send_gemini_cli` (lines 1783-1897), possibly `src/thinking_parser.py:9` +**Symptom**: User reported thinking monologues don't render for Gemini. The current `parse_thinking_trace` regex matches ``, ``, and `Thinking:` prefix. The Gemini SDK may emit a different format. +**Investigation needed**: empirically run a Gemini request that produces reasoning and inspect the raw `resp.text`. If the format is incompatible, add a normalization pass. + +#### G16: `` (half-width) marker support (Bug #5 deferred from `ai_loop_regressions_20260614`) + +**File:line**: `src/thinking_parser.py:9` (the regex at line 9) +**Symptom**: User screenshot 1 showed `This is DWARF debug info, not the actual disassembly...` — the half-width form. The current regex doesn't match this. +**Fix**: extend the `tag_pattern` to also match `...` (the closing tag is the same). + +#### G17: `state.toml` duplicate-key bug (housekeeping, blocks `ai_loop_regressions_20260614` archival) + +**File:line**: `conductor/tracks/ai_loop_regressions_20260614/state.toml` lines 23-26 and 46-58 +**Symptom**: Python's `tomllib.load()` raises `TOMLDecodeError: Cannot overwrite a value (at line 23, column 123)` +**Fix**: Delete the duplicate `phase_2..5` and `t2_1..t5_4` entries (the "pending" duplicates of the "completed" entries that already have the correct commit SHAs). + +#### G18: `tracks.md` row 24 not updated (housekeeping) + +**File:line**: `conductor/tracks.md:41` +**Symptom**: Track 24 still shows "spec ✓, plan ✓, ready to start" though the track shipped on 2026-06-15. +**Fix**: Update the status column to reflect completion, OR move the row to a "Recently Completed" section (per existing convention used by `qwen_llama_grok_integration_20260606`). + +## 4. Functional Requirements + +### FR1: Fix `_api_generate` NameError (G1) + +`_api_generate` in `src/app_controller.py:265-295` must: +1. Have `context_to_send` properly defined before the `send_result()` call. +2. Continue to use the `_disc_entries_lock` for thread-safe access to `disc_entries`. +3. Continue to use the `if not result.ok: raise HTTPException(502, ...)` pattern from the FR2 fix. + +The fix is 2-3 lines added before line 278: +```python +with controller._disc_entries_lock: + has_ai_response = any(e.get("role") == "AI" for e in controller.disc_entries) +context_to_send = stable_md if not has_ai_response else "" +``` + +### FR2: Fix the 11 pre-existing test mock bugs (G2-G12, G14) + +For each of the 11 tests, change the assertion pattern to handle `Result[str]`: +- `assert result == "x"` → `assert result.ok and result.data == "x"` +- `assert "text" in result` → `assert result.ok and "text" in result.data` + +For the Grok web_search / x_search tests (G4, G5), the test now goes through the tool loop and the mock is called multiple times. Change `assert captured_kwargs[0]...` to `assert any(kw["extra_body"]... for kw in captured_kwargs)`. + +For `test_headless_service.test_generate_endpoint` (G14): change the mock from `ai_client.send` to `ai_client.send_result` returning `Result(data="AI Response")`. + +### FR3: Fix `test_ai_client_tool_loop_builder` mock shape (G13) + +The mock at `tests/test_ai_client_tool_loop_builder.py:33` uses `patch("src.openai_compatible.send_openai_compatible", side_effect=[tool_response, final])` and returns raw `NormalizedResponse` objects. Since `run_with_tool_loop._default_send` now does `if not res.ok:` expecting a `Result[NormalizedResponse]`, the mock must return `Result(data=tool_response)` and `Result(data=final)`. + +### FR4: Investigate and fix Gemini thinking format (G15) + +Phase 3 task. Empirically investigate: +1. Run a Gemini request (real or mocked) that produces thinking content. +2. Inspect the raw `resp.text` to see what format it uses. +3. If the format is not `...` or `Thinking:`, decide: + - **Option A**: Add a normalization pass in `_send_gemini` and `_send_gemini_cli` to wrap the thinking in `` tags before returning. + - **Option B**: Extend `parse_thinking_trace` to match the new format. + +The empirical finding determines the approach. Document the result in the commit message. + +### FR5: Add `` half-width marker support (G16) + +Extend the `tag_pattern` regex at `src/thinking_parser.py:9` to also match `...` (half-width). The fix is a single regex addition to the existing pattern. Update the 5+ existing tests in `tests/test_thinking_trace.py` to verify the new pattern works. + +### FR6: Fix `state.toml` duplicate keys (G17) + +Delete lines 23-26 and 46-58 from `conductor/tracks/ai_loop_regressions_20260614/state.toml`. The "completed" entries at lines 18-22 and 29-45 are correct; the "pending" duplicates must be removed. + +### FR7: Update `tracks.md` row 24 (G18) + +Update the status column at `conductor/tracks.md:41` to reflect the track's completion. The user preferred pattern (move to "Recently Completed" or just update status) is a Tier 1 review decision; either is acceptable. + +### FR8: Regression sweep + doc update + +Phase 5 task. Run the full test suite (`uv run pytest tests/`) and verify all G1-G13 + FR1-FR5 fixes are green. Update `docs/guide_ai_client.md` "See Also" section with cross-references to this track (similar to what was done in `ai_loop_regressions_20260614`). + +## 5. Non-Functional Requirements + +- **NFR1 (Atomic per-task commits)**: each plan task is one commit; no batching. Use the project's "1 commit per task" discipline (see `conductor/workflow.md`). +- **NFR2 (1-space indentation)**: enforced by the project's AI-Optimized Python style. +- **NFR3 (No diagnostic noise in production)**: no `sys.stderr.write("[XYZ_DIAG] ...")` lines in committed code. If instrumentation is needed for the TDD test, it goes to `tests/artifacts/.diag.log`. +- **NFR4 (Test isolation)**: the 11 test mock fixes must NOT use `unittest.mock.patch` to bypass the new Result API; they must correctly unwrap `result.data` or check `result.ok`. Per the project's "No Mock Patches to Pseudo API" anti-pattern rule. +- **NFR5 (No regression in other providers)**: the 5 unaffected providers (Anthropic, Qwen, Grok non-thinking tests, Llama non-mock tests, Llama native non-mock tests) must continue to pass their existing tests. +- **NFR6 (Thread safety)**: the FR1 fix in `_api_generate` must use `_disc_entries_lock` (the same lock the original code used) to avoid races with the GUI's discussion updates. + +## 6. Architecture Reference + +For implementation details, consult: + +- **`docs/guide_ai_client.md`**: the canonical guide for `src/ai_client.py`; the new `send_result()` API is documented in the "Data-Oriented Error Handling (Fleury Pattern) > Public API" section. The test mock fixes (FR2, FR3) follow the patterns shown there. +- **`docs/guide_app_controller.md`**: the canonical guide for `src/app_controller.py`; the `_api_generate` and `_handle_request_event` flows are described in §"AI Loop Lifecycle". The FR1 fix lives in this subsystem. +- **`docs/guide_thinking.md`** (or `docs/guide_discussions.md`): the canonical guide for thinking-mono rendering; the `parse_thinking_trace` markers are documented. FR4 (Gemini format) and FR5 (half-width marker) are in this subsystem. +- **`conductor/code_styleguides/error_handling.md`**: the canonical reference for the Result/ErrorInfo pattern; the new FR2 test assertions follow §3.1 "AND over OR (Result struct with side-channel errors)". +- **`docs/reports/TRACK_COMPLETION_ai_loop_regressions_20260615.md`**: the parent track's completion report. The G17 state.toml bug and the G18 tracks.md row issue are documented in the Tier 1 review §"Critical Issues" of that track. + +## 7. Out of Scope + +The following items are **explicitly out of scope** and tracked elsewhere: + +- **`public_api_migration_20260606`** (planned, separate track): removes the deprecated `ai_client.send()` and migrates 5 production + 63 test call sites to `send_result()`. This track only fixes the broken `_api_generate` site (G1) and the test mock bugs that the public_api migration would touch (G2-G12). The other 50+ test call sites are deferred to public_api. +- **`live_gui_mock_injection_20260615`** (not yet specced): infrastructure for mock injection into the live_gui subprocess. Recommended as a separate track because it requires infrastructure work (subprocess mock protocol, conftest changes) and unblocks future live_gui + AI client tests. +- **`test_rag_phase4_final_verify` flakiness**: pre-existing RAG subsystem issue (not caused by the data_oriented_error_handling or ai_loop_regressions tracks). The `'NoneType' object has no attribute 'get'` error is in RAG config lookup code, not AI client code. Recommended as a separate RAG track. +- **`test_discussion_truncate_layout.py::test_keep_pairs_input_uses_adequate_width`**: Phase 2 of the UI Polish Five Issues track (`ui_polish_five_issues_20260302`). The track spec is at `docs/superpowers/specs/2026-06-03-ui-polish-design.md`. +- **`test_log_management_refresh.py::test_refresh_registry_button_calls_load_registry`**: Phase 3 of the same UI Polish track. Both are out of scope here. +- **The deprecated `ai_client.send()` removal**: that's the public_api_migration_20260606 track. + +## 8. Phases (Summary) + +| Phase | Name | Tasks | Verification | +|---|---|---|---| +| **Phase 1** | **CRITICAL: Fix `_api_generate` NameError (G1)** | 2 tasks: write failing test (`test_generate_endpoint` already exists; verify it fails for the NameError reason), fix the production code | `test_headless_service.test_generate_endpoint` returns 200 | +| **Phase 2** | **Fix 10 test mock bugs (G2-G12, G14) + 1 mock shape fix (G13)** | 11 tasks: one per test file (4-5 per file group), TDD-red + green per file | Full suite has 11 fewer failures | +| **Phase 3** | **Fix Gemini / Gemini CLI thinking-format (G15)** | 3 tasks: empirical investigation, fix the format mismatch (either normalization pass or parser extension), live_gui verification | Gemini thinking mono renders in Discussion Hub | +| **Phase 4** | **Add `` half-width marker (G16)** | 2 tasks: extend regex in `thinking_parser.py:9`, add 1+ new tests in `test_thinking_trace.py` | `parse_thinking_trace` extracts 1 segment from `...` text | +| **Phase 5** | **Housekeeping + regression sweep + docs (G17, G18, FR8)** | 4 tasks: fix `state.toml` duplicates, update `tracks.md`, full suite sweep, doc update | Full suite green; state.toml parseable; tracks.md row 24 updated | + +## 9. Risk Analysis + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| **R1**: The FR1 `_api_generate` fix accidentally introduces a regression in the existing FR2/FR3 logic | Low | High | The fix only ADDS lines, doesn't modify any existing logic. After the fix, the function matches the original (pre-`ai_loop_regressions_20260614`) semantics. | +| **R2**: The 11 test mock fixes have subtle differences in `result.ok` semantics that cause new test failures | Low | Low | The pattern is mechanical (`assert result.ok` then `assert result.data == "x"`). If a test is `assert result.ok` and `result.ok` is False, the failure message is clear (shows the ErrorInfo). | +| **R3**: The Gemini thinking format investigation (Phase 3) requires running a real Gemini request, which the user may not have credentials for | Medium | Medium | If real Gemini credentials are unavailable, use a mock client that returns a realistic Gemini response with thinking content. Document the format assumption. | +| **R4**: The `` regex extension accidentally matches too much (e.g., greedy matching across multiple segments) | Low | Low | Use `re.DOTALL` + non-greedy `.*?` (consistent with the existing pattern). The existing 5+ tests in `test_thinking_trace.py` will catch regressions. | +| **R5**: The `state.toml` cleanup (Phase 5) accidentally deletes the wrong lines | Very Low | High | Only delete the duplicate "pending" entries; the "completed" entries with commit SHAs must be preserved. The fix is mechanical and verifiable by re-running `tomllib.load()`. | + +## 10. Coordination with Pending Tracks + +This track is **independent** (no `blocked_by`) but interacts with: + +- **`ai_loop_regressions_20260614`** (shipped 2026-06-15): this track fixes the production regression (G1) and housekeeping issues (G17, G18) that the parent track left behind. It also picks up the 2 deferred bugs (G15, G16) from the parent's spec §13. No direct dependency — the parent track is shipped; this track is cleanup. +- **`public_api_migration_20260606`** (planned, not yet specced): this track's G2-G12 test mock fixes overlap with the public_api track's test migration scope. After this track ships, the public_api track will have 11 fewer tests to migrate. The public_api track is responsible for the remaining 50+ test call sites and the 5 production call sites. +- **`data_oriented_error_handling_20260606`** (shipped 2026-06-12): the root cause of the G2-G14 test mock bugs. This track is the test-cleanup follow-up to the parent refactor. No direct interaction — the parent track is shipped; this track fixes the remaining test fallout. +- **UI Polish Five Issues track** (`ui_polish_five_issues_20260302`): the 2 out-of-scope test failures (`test_discussion_truncate_layout`, `test_log_management_refresh`) are Phase 2 and Phase 3 of that track. That track has its own plan and is ready to start; this track does not touch it. + +## 11. Verification Criteria (definition of "done") + +The track is complete when ALL of the following are true: + +- [ ] `test_headless_service::TestHeadlessAPI::test_generate_endpoint` returns 200 (proves the G1 fix). +- [ ] All 11 test mock fixes (G2-G12) pass: full batched test suite has 11 fewer failures than before. +- [ ] `test_ai_client_tool_loop_builder::test_run_with_tool_loop_calls_request_builder_each_round` passes (G13). +- [ ] Phase 3 Gemini investigation produces a finding: either a normalization pass in `_send_gemini*` is added OR the parser is extended, AND a live_gui test or unit test demonstrates Gemini thinking-mono rendering. +- [ ] `parse_thinking_trace` correctly extracts 1 ThinkingSegment from `...` text (G16). +- [ ] `tests/test_thinking_trace.py` has 1+ new test for the half-width marker; all existing 5+ tests still pass. +- [ ] Python's `tomllib.load()` on `conductor/tracks/ai_loop_regressions_20260614/state.toml` succeeds (G17). +- [ ] `conductor/tracks.md` row 24 reflects the track's completion (G18). +- [ ] Full test suite is green (no new failures beyond the deferred test_rag_phase4_final_verify and UI Polish tests). +- [ ] `docs/guide_ai_client.md` "See Also" section has 2 new cross-references: (1) this cleanup track; (2) reference to `public_api_migration_20260606`. +- [ ] `metadata.json` `verification_criteria` field is updated to reflect completion. + +## 12. See Also — Follow-up Notes + +### 12.1 `public_api_migration_20260606` (planned, separate track) + +Migrates the remaining 5 production call sites and 63 test call sites to `send_result()`. This track fixes only the broken `_api_generate` site (G1) and the 11 test mock bugs that the public_api track would have touched (G2-G12). The remaining ~50 test call sites and 5 production call sites are deferred. + +### 12.2 `live_gui_mock_injection_20260615` (not yet specced) + +Infrastructure for mock injection into the live_gui subprocess. The `ai_loop_regressions_20260614` Tier 2 review (§9 of the report) recommended this as a follow-up because the live_gui smoke tests only verify the Hook API substrate is reachable — they don't exercise the full request → AI client → discussion pipeline end-to-end. Without this infrastructure, future tracks hitting live_gui + AI client will hit the same wall. + +### 12.3 `test_rag_phase4_final_verify` flakiness (separate RAG concern) + +Pre-existing RAG subsystem issue not caused by the data_oriented_error_handling or ai_loop_regressions tracks. The error `'NoneType' object has no attribute 'get'` is in RAG config lookup code, not AI client code. A partial fix was attempted in commit `16412ad5` (RAG Phase 4 dim-mismatch recovery). Recommended as a separate RAG track. + +### 12.4 UI Polish Five Issues track (separate track) + +The 2 unrelated test failures in the full suite (`test_discussion_truncate_layout` and `test_log_management_refresh`) are Phase 2 and Phase 3 of the UI Polish track (`ui_polish_five_issues_20260302`). That track has its own spec and plan. Not in scope here.