conductor(track): spec for ai_loop_regressions_20260614 (MiniMax/Gemini/Gemini CLI/DeepSeek)

2026-06-15 00:33:04 -04:00
parent 74e02485a1
commit 7a4dcc9690
1 changed files with 210 additions and 0 deletions
@@ -0,0 +1,210 @@
+# Track: AI Loop Regressions (MiniMax, Gemini, Gemini CLI, DeepSeek)
+
+**Status:** Active (spec approved 2026-06-14)
+**Initialized:** 2026-06-14
+**Owner:** Tier 2 Tech Lead
+**Priority:** High (4 providers broken in production; user-facing symptom)
+
+---
+
+## 1. Overview
+
+This track diagnoses and fixes 4 user-visible regressions in the AI loop that surfaced after the `data_oriented_error_handling_20260606` track shipped (2026-06-12) and the subsequent `ai client pass` commit `5030bd84` (2026-06-13, 503-line `src/ai_client.py` refactor in the Gemini region). The regressions affect **MiniMax (M2.x), Gemini, Gemini CLI, and DeepSeek** — the 4 providers most heavily touched by the refactor.
+
+The reported symptoms (per user 2026-06-14):
+1. **Thinking monologues no longer render** in the Discussion Hub.
+2. **AI turns do not get entries** in the Discussion Hub; the user must manually add them via the `History` button.
+
+The 2 symptoms are the visible result of **3 distinct bugs** interacting. Bug #2 is the primary culprit for the "no entries" symptom; Bug #3 is the primary culprit for the "no thinking" symptom on MiniMax; Bug #1 is dead code that breaks the error-reporting path. The user-supplied screenshots show entries in the Operations Hub `Comms History` and in the `Comms History` panel — confirming the requests reach the AI client and responses are emitted, but the response doesn't propagate to the discussion panel.
+
+## 2. Goals (Priority Order)
+
+| Priority | Goal | Rationale |
+|---|---|---|
+| **A (primary value)** | Fix Bug #2: `_handle_request_event` (the live AI send path) routes `send_result()` errors back into the Discussion Hub as error entries, restoring the pre-refactor UX. | The "no entries" symptom is the user-blocking bug. Fixing it makes the AI loop immediately usable again. |
+| **A (primary value)** | Fix Bug #3: MiniMax thinking content (`reasoning_details[0].text`) is wrapped in `<thinking>...</thinking>` tags in the returned text, so `thinking_parser.parse_thinking_trace` can extract it and the discussion entry shows the thinking segment. | MiniMax is the user's current provider; thinking monologues are a core feature. Without this fix the user cannot see the AI's reasoning. |
+| **B (architectural)** | Fix Bug #1: replace the 3 dead `except ai_client.ProviderError as e:` clauses in `src/app_controller.py` with the equivalent `send_result()` + `if not result.ok: ...` pattern. | The dead clauses silently swallow the `AttributeError` that arises when Python tries to evaluate `ai_client.ProviderError` to compare against the in-flight exception. The replacement aligns with the data-oriented error handling convention and gives Tier 2 a clean reference for the planned `public_api_migration_20260606` follow-up. |
+| **C (diagnostic)** | Root-cause verification: each of the 3 fixes is preceded by a failing TDD test that reproduces the bug, and a commit history audit is documented in the spec. | The user explicitly asked for an investigation track. The diagnostic tests are the empirical evidence for each root cause. |
+| **D (forward-looking)** | Document the deferred Gemini / Gemini CLI thinking-format investigation as a follow-up note in `docs/guide_ai_client.md` "See Also" section. | The user's complaint includes Gemini, but the format-compatibility issue is plausibly a pre-existing limitation, not a new regression. Documented as a follow-up to avoid scope creep. |
+
+### 2.1 Non-Goals (this track)
+
+- **Not** migrating the 5 remaining production call sites or 63 test call sites to `send_result()`. The planned `public_api_migration_20260606` follow-up track handles that. This track only migrates the 3 sites that are actively broken (the dead `except` clauses in `app_controller.py:305, 313, 3692`) — the minimum needed to make the live path work.
+- **Not** expanding the `thinking_parser.py` contract to support new marker formats. The `<thinking>`, `<thought>`, and `Thinking:` markers are the canonical set; the MiniMax fix uses the existing `<thinking>` format (matches DeepSeek's pattern).
+- **Not** investigating or fixing the Gemini / Gemini CLI thinking-format compatibility (deferred; see §13.1).
+- **Not** changing the `ProviderError` removal (it was correctly removed in commit `64b787b8`); we only fix the dead except clauses.
+- **Not** adding a new `thinking_parser` format; we work within the existing 3-marker contract.
+
+## 3. Current State Audit (as of commit `5030bd84`)
+
+### 3.1 Already Implemented (DO NOT re-implement)
+
+- **`src/result_types.py`**: `Result[T]`, `ErrorInfo`, `ErrorKind` dataclasses exist; `Result.data: T` + `Result.errors: list[ErrorInfo]` is the canonical pattern.
+- **`src/ai_client.py:send_result()`** (lines 2970-3092): the new public entry point, returns `Result[str]`. Routes to `_send_<vendor>_result()` per provider.
+- **`src/ai_client.py:send()`** (lines 2907-2968): the `@deprecated` shim, calls `send_result()` and returns `result.data`. **Never raises on error** — returns `""` instead.
+- **`src/ai_client.py:_send_*_result()`** (lines 1291-3082): all 9 vendors (`anthropic`, `gemini`, `gemini_cli`, `deepseek`, `minimax`, `qwen`, `grok`, `llama`, `llama_native`) return `Result[str]` with `ErrorInfo` on failure.
+- **`src/ai_client.py:run_with_tool_loop()`** (lines 734-836): already extracts reasoning via `reasoning_extractor` and stores it in `history[].reasoning_content`. The reasoning content is in the history but **NOT** in the returned text.
+- **`src/thinking_parser.py:parse_thinking_trace()`** (lines 8-54): already extracts `<thinking>`, `<thought>`, and `Thinking:` prefix segments.
+- **`src/app_controller.py:_on_comms_entry()`** (lines 3749-3840): already routes `response` comms entries to `_pending_history_adds` if `text_content.strip()` is truthy and `parse_thinking_trace` finds segments.
+- **DeepSeek's reasoning wrap pattern** (`src/ai_client.py:2117-2118`): DeepSeek wraps `reasoning_content` in `<thinking>...</thinking>` tags in the final text before returning. This is the reference pattern for the MiniMax fix.
+
+### 3.2 Gaps to Fill (This Track's Scope)
+
+| # | File:line | Gap | Symptom |
+|---|---|---|---|
+| **G1** | `src/app_controller.py:3677-3697` | `_handle_request_event` calls deprecated `ai_client.send()` and discards the result. On error, `result.data == ""` is queued as a `response` comms entry, but `_on_comms_entry` at line 3801 filters it out via `if text_content.strip():`. No discussion entry is added. | "AI turns are not getting proper entries" |
+| **G2** | `src/app_controller.py:305, 313, 3692` | Three `except ai_client.ProviderError as e:` clauses reference a class that was removed in commit `64b787b8`. Python evaluates the class on every raised exception; on missing class, the except clause itself raises `AttributeError`. The error path is broken. | Silently dropped error messages (compounding G1) |
+| **G3** | `src/ai_client.py:797-836, 2418-2443` | `_send_minimax()` uses `reasoning_extractor` to extract reasoning into `history[].reasoning_content`, but the returned `response_text` (and thus `Result.data`) does not include the thinking tags. `parse_thinking_trace` finds no `<thinking>` blocks, so no thinking segments are added to the discussion entry. | "Thinking monologues no longer rendering" (MiniMax) |
+| **G4** | (deferred) `src/ai_client.py:_send_gemini`, `_send_gemini_cli` | Gemini SDK output may include thinking in a format that `parse_thinking_trace` doesn't match. Empirical verification needed. | "Thinking monologues no longer rendering" (Gemini) |
+
+## 4. Functional Requirements
+
+### FR1: Error response becomes a discussion entry (Bug #2 / G1)
+
+`_handle_request_event` in `src/app_controller.py:3677-3697` must:
+
+1. Call `ai_client.send_result()` instead of `ai_client.send()`.
+2. On `result.ok == False`: queue a `response` comms entry with `text=ui_error_message()`, `status="error"`, `role="Vendor API"` so the user sees the error in both the AI response panel AND as a discussion entry.
+3. On `result.ok == True`: queue a `response` comms entry with `text=result.data`, `status="done"`, `role="AI"` (preserves current behavior).
+4. Update `_ai_status` to `f"error: {ui_error_message()}"` on failure (preserves the visible status indicator).
+5. Preserve the existing streaming path (`_on_ai_stream` continues to receive chunks during `stream=True` execution).
+
+### FR2: Replace dead `except ai_client.ProviderError` clauses (Bug #1 / G2)
+
+All 3 sites in `src/app_controller.py` (`305, 313, 3692`) must:
+
+1. Remove the `except ai_client.ProviderError` clause.
+2. Replace with either:
+   - **For sites that call `ai_client.send()`**: call `ai_client.send_result()` instead; if `not result.ok`, route the error to the API response (HTTPException for the API sites, comms queue for the live site).
+   - **For sites that call other `ai_client` methods that raise**: use a generic `except Exception` and convert to a structured response (HTTPException for API sites, error entry for the live site).
+3. Reference the data-oriented error handling styleguide (`conductor/code_styleguides/error_handling.md` §3.1) in the resulting code's docstring (so future migrations follow the same pattern).
+
+### FR3: MiniMax thinking content reaches `parse_thinking_trace` (Bug #3 / G3)
+
+`_send_minimax` in `src/ai_client.py:2418-2443` (or `run_with_tool_loop` at lines 797-836) must:
+
+1. When `caps.reasoning` is True AND the previous round extracted non-empty `reasoning_content`, the NEXT round's `response_text` (and `Result.data`) must include the reasoning wrapped in `<thinking>...</thinking>` tags (matching DeepSeek's pattern at `src/ai_client.py:2117-2118`).
+2. The `run_with_tool_loop` history write at line 808 must continue to store the raw `reasoning_content` (so subsequent API calls can use it for the next turn's reasoning). The thinking tag wrapping is additive: the raw reasoning is in the history, the tagged reasoning is in the visible text.
+3. The `<think>...</think>` format used by some MiniMax models (visible in the user-supplied screenshot 1) must continue to work — `parse_thinking_trace` already supports it (the regex at `src/thinking_parser.py:22` matches `<thinking>` and `<thought>`; the screenshot shows the `<think>` format which is **not** currently supported — this is a separate bug and is deferred to the follow-up).
+
+   **Important scope clarification**: The user's screenshot shows `<think>This is DWARF debug info...</think>` style — using the half-width `<think>` (no closing match for the regex). The MiniMax fix in this track wraps the reasoning in `<thinking>` (the supported form), not `<think>`. This is a temporary scope reduction: the fix restores thinking-mono rendering for the common case (DeepSeek-style `<thinking>` tags), and the half-width `<think>` format is a known gap that's documented as a follow-up.
+
+### FR4: No new files in `src/`
+
+Per the project's hard rule (see `AGENTS.md` "File Size and Naming Convention"), no new `src/<thing>.py` files. All fixes go in:
+- `src/app_controller.py` (FR1, FR2)
+- `src/ai_client.py` (FR3)
+
+### FR5: Tests cover all 3 fixes
+
+- `tests/test_ai_loop_regressions_20260614.py` (new file): TDD tests for FR1, FR2, FR3.
+  - **FR1 tests** (3+ tests): (a) successful response becomes a discussion entry; (b) error response becomes a discussion entry with `status="error"`; (c) `_ai_status` is updated correctly on both paths.
+  - **FR2 tests** (2+ tests): (a) the dead `except ProviderError` clause is removed (assert no longer present via AST scan); (b) the replaced code path correctly raises HTTPException for the API sites.
+  - **FR3 tests** (2+ tests): (a) `_send_minimax` returns `Result.data` that contains `<thinking>` tags when reasoning is extracted; (b) the discussion entry's `thinking_segments` field is populated when `parse_thinking_trace` is run on the result.
+
+## 5. Non-Functional Requirements
+
+- **NFR1 (Atomic per-task commits)**: each plan task is one commit; no batching.
+- **NFR2 (1-space indentation)**: enforced by the project's AI-Optimized Python style.
+- **NFR3 (No diagnostic noise in production)**: no `sys.stderr.write("[XYZ_DIAG] ...")` lines in the committed code. If instrumentation is needed for the TDD test, it goes to `tests/artifacts/<test_name>.diag.log` (not in the test file itself).
+- **NFR4 (Backward compatibility)**: the deprecated `ai_client.send()` shim remains working (the `public_api_migration_20260606` track is responsible for removal; this track only fixes the 3 broken except clauses).
+- **NFR5 (No regression in other providers)**: the 5 unaffected providers (Anthropic, Qwen, Grok, Llama, Llama native) must continue to pass their existing tests.
+- **NFR6 (Thread safety)**: all fixes preserve the existing `_send_lock` and per-provider history locks; the fix for FR1 must not introduce a new race between the streaming `_on_ai_stream` callback and the final `result.data` write.
+
+## 6. Architecture Reference
+
+For implementation details, consult:
+
+- **`docs/guide_ai_client.md`**: the canonical guide for `src/ai_client.py`; the new `send_result()` API is documented in the "Data-Oriented Error Handling (Fleury Pattern) > Public API" section. FR1 and FR3 should follow the patterns shown there.
+- **`docs/guide_app_controller.md`**: the canonical guide for `src/app_controller.py`; the `_handle_request_event` and `_on_comms_entry` flows are described in §"AI Loop Lifecycle". FR1 and FR2 changes are in this subsystem.
+- **`docs/guide_thinking.md`** (if it exists; otherwise `docs/guide_discussions.md`): the canonical guide for thinking-mono rendering; the `parse_thinking_trace` markers are documented in §"Thinking Markers".
+- **`conductor/code_styleguides/error_handling.md`**: the canonical reference for the Result/ErrorInfo pattern; the new FR2 code paths should follow §3.1 "AND over OR (Result struct with side-channel errors)".
+- **`docs/reports/data_oriented_error_handling_phase3_20260612.md`** (if it exists; otherwise the metadata.json `deprecation_strategy` section of the parent track): documents the `send_result()` deprecation strategy and the planned `public_api_migration_20260606` follow-up.
+
+## 7. Out of Scope
+
+- **Gemini / Gemini CLI thinking-format compatibility investigation** (Bug #4 / G4). The user's complaint includes Gemini, but the format may be a pre-existing limitation. Documented as a follow-up in §13.1.
+- **Migrating the remaining 5 production call sites + 63 test call sites to `send_result()`**. The planned `public_api_migration_20260606` track handles this.
+- **Expanding `thinking_parser.py` to support new marker formats** (e.g., `<think>` without closing `</think>`).
+- **Restructuring `_handle_request_event` to be testable in isolation** (a follow-up; this track's tests use mocks for the AI client, not the controller).
+- **Any changes to the `multi_agent_conductor.py` MMA worker interface** (it still uses `send()`; will migrate in the public_api track).
+- **Restoring the `<think>` (half-width) marker support**. The user's screenshot shows this format; the current `parse_thinking_trace` regex requires `<thinking>` (full-width). This is a separate gap documented in §13.2.
+
+## 8. Phases (Summary)
+
+| Phase | Name | Tasks | Verification |
+|---|---|---|---|
+| **Phase 1** | **Root-cause verification** (TDD red) | 3 tasks: write 3+ failing tests for FR1, FR2, FR3; commit each as a separate test | `pytest tests/test_ai_loop_regressions_20260614.py` shows red |
+| **Phase 2** | **Fix FR1 (Bug #2): error response becomes a discussion entry** | 3 tasks: implement the fix in `_handle_request_event`; run the FR1 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr1*` shows green |
+| **Phase 3** | **Fix FR2 (Bug #1): replace dead `except ProviderError` clauses** | 3 tasks: replace 3 sites; run the FR2 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr2*` shows green; AST scan shows no `ProviderError` references |
+| **Phase 4** | **Fix FR3 (Bug #3): MiniMax thinking mono rendering** | 3 tasks: wrap reasoning in `<thinking>` tags in `_send_minimax` (or in `run_with_tool_loop`); run the FR3 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr3*` shows green |
+| **Phase 5** | **Regression sweep + docs** | 3 tasks: run full `pytest tests/`; add follow-up note to `docs/guide_ai_client.md` "See Also" section; commit | Full suite green; doc note present |
+
+## 9. Risk Analysis
+
+| Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|
+| **R1**: The Phase 4 fix (MiniMax thinking wrap) breaks the existing DeepSeek tests because both use `run_with_tool_loop`. | Medium | High | Apply the wrap only when `reasoning_extractor` is set AND returns non-empty; preserve the DeepSeek-specific path (which already wraps). The fix is conditional on `caps.reasoning`, not universal. |
+| **R2**: The FR1 fix changes the streaming behavior — the streaming chunks go through `_on_ai_stream` (via `stream_callback`), and the final `result.data` is set after streaming completes. The fix must not break the existing streaming contract. | Medium | High | The FR1 fix only changes the FINAL response comms entry (after `send_result()` returns). The streaming path is unchanged. Phase 2's test must include a streaming test to lock this. |
+| **R3**: The 3 sites in `app_controller.py` that have `except ProviderError` may have other callers depending on the exception behavior. | Low | Medium | All 3 sites are in `_handle_request_event` (1 site) and 2 API hook endpoints (`/api/v1/generate`, `/api/v1/generate_sync`). The fix routes errors the same way the original code intended, just via `Result.ok` instead of `ProviderError`. |
+| **R4**: The `parse_thinking_trace` regex is greedy; wrapping thinking in `<thinking>` tags and then parsing it may produce nested segments. | Low | Low | The regex at `src/thinking_parser.py:9` is `re.DOTALL \| re.IGNORECASE` and uses `.*?` (non-greedy). Nested `<thinking>` blocks would not match because the outer block consumes the inner; this is the same behavior DeepSeek has, and the existing tests pass for DeepSeek. |
+| **R5**: The user is wrong about Gemini / Gemini CLI — those may not actually be broken. | Medium | Low | The deferred Phase-5-style follow-up will investigate empirically. The user's primary report was MiniMax; the other 3 are mentioned as "all regressed" but the fix for Bug #1 (dead except clauses) and Bug #2 (empty data) restores them all to working order. The thinking-mono issue is MiniMax-specific. |
+
+## 10. Coordination with Pending Tracks
+
+This track is **independent** (no `blocked_by` and no `blocks` in `metadata.json`). It does not depend on or block any active track.
+
+However, it interacts with:
+- **`public_api_migration_20260606`** (planned, not yet specced): this track's FR1 fix to `_handle_request_event` is a partial migration. The full migration (5 production + 63 test sites) is out of scope here; the follow-up track picks up where this leaves off. The two tracks share the same destination but this track fixes the user-blocking regressions first.
+- **`data_oriented_error_handling_20260606`** (shipped 2026-06-12): this track is the user-facing bug-fix for the issues introduced by the parent track. It does not modify any of the 3 files the parent track touched (`mcp_client.py`, `ai_client.py`, `rag_engine.py`); it only modifies `app_controller.py` (the 1 file the parent track did NOT touch). The MiniMax fix touches `ai_client.py` for FR3 (1 file the parent touched).
+- **`qwen_llama_grok_followup_20260611`** (archived 2026-06-11): no direct interaction, but the MiniMax fix in FR3 follows the same reasoning-extraction pattern that track introduced for the OpenAI-compatible providers.
+
+## 11. Verification Criteria (definition of "done")
+
+The track is complete when ALL of the following are true:
+
+- [ ] All 3 phase 1-4 tests pass (`pytest tests/test_ai_loop_regressions_20260614.py` shows green).
+- [ ] Full test suite passes (`uv run pytest tests/` shows green; no new failures).
+- [ ] `grep -rn "ProviderError" src/` returns no matches.
+- [ ] `grep -rn "ai_client\.ProviderError" src/` returns no matches.
+- [ ] Live GUI test: a MiniMax `M2.7` request with reasoning returns a discussion entry that includes a `thinking_segments` field (use the `live_gui` fixture + `ApiHookClient.get_value("disc_entries")`).
+- [ ] Live GUI test: a MiniMax request that fails (e.g., invalid API key) returns a discussion entry with `status="error"` and the error message in the `content` field.
+- [ ] Live GUI test: a Gemini request that succeeds returns a discussion entry (verifies the FR1 fix doesn't break Gemini).
+- [ ] `docs/guide_ai_client.md` "See Also" section includes the 2 follow-up notes (§13.1 Gemini thinking investigation, §13.2 `<think>` half-width marker support).
+- [ ] `metadata.json` `verification_criteria` field is updated to reflect completion.
+
+## 12. Acceptance Test (the user can verify this themselves)
+
+After this track ships, the user should be able to:
+
+1. Open Manual Slop with MiniMax as the active provider.
+2. Send a message that requires the AI to reason (e.g., "explain the structure of this function").
+3. Verify: the AI's response appears in the Discussion Hub **without** manually pressing the `History` button.
+4. Verify: the response has a `Monologue` collapsible section showing the AI's thinking.
+5. Trigger a failure (e.g., switch to an invalid MiniMax API key, then send a message).
+6. Verify: an error entry appears in the Discussion Hub with the error message.
+
+Before this track ships, steps 3 and 4 fail (for MiniMax); step 6 fails (for all 4 affected providers).
+
+## 13. See Also — Follow-up Notes
+
+### 13.1 Gemini / Gemini CLI thinking-format compatibility (deferred)
+
+The user's complaint includes Gemini and Gemini CLI. The likely cause is a format mismatch between what the Gemini SDK outputs and what `parse_thinking_trace` recognizes:
+
+- `parse_thinking_trace` (`src/thinking_parser.py:9`) matches `<thinking>`, `<thought>`, and `Thinking:` prefix.
+- The Gemini SDK's `resp.text` may include thinking as plain prose or as `*thinking aloud*` markdown, depending on the SDK version and the model's prompt formatting.
+
+This track fixes Bugs #1, #2, #3. The Gemini / Gemini CLI thinking-format issue is plausibly a pre-existing limitation (the existing tests for `parse_thinking_trace` show it doesn't match all Gemini output formats) rather than a new regression from the recent refactor.
+
+**Follow-up track** (to be specced): investigate empirically by running a Gemini request that produces reasoning and inspecting the raw `resp.text`; add a normalization pass in `_send_gemini*` if needed.
+
+### 13.2 `<think>` (half-width) marker support (deferred)
+
+The user's screenshot 1 shows a discussion entry containing `<think>This is DWARF debug info, not the actual disassembly...</think>` — the half-width `<think>` form (no closing `</think>` in the regex). The current `parse_thinking_trace` regex (`src/thinking_parser.py:9`) requires the full `<thinking>` form. Some models (notably certain DeepSeek-R1 outputs and possibly the MiniMax M2.7 output) use the half-width `<think>` form.
+
+**Follow-up track** (to be specced): extend `parse_thinking_trace` to support the half-width `<think>...</think>` form (the closing tag is the same). The change is small (~3 lines in `src/thinking_parser.py:9`); the test file is `tests/test_thinking_trace.py` (5+ existing tests for the full-width form).
+
+### 13.3 Public API Result Migration (planned, separate)
+
+The `public_api_migration_20260606` follow-up (planned, not yet specced) will migrate the 5 remaining production call sites and 63 test call sites to `send_result()`. This track fixes the 3 sites in `app_controller.py` that are actively broken; the public_api track picks up from there.