From 7a4dcc96904757c6a002a3dfbf502f4aa5fd49bf Mon Sep 17 00:00:00 2001 From: Ed_ Date: Mon, 15 Jun 2026 00:33:04 -0400 Subject: [PATCH] conductor(track): spec for ai_loop_regressions_20260614 (MiniMax/Gemini/Gemini CLI/DeepSeek) --- .../ai_loop_regressions_20260614/spec.md | 210 ++++++++++++++++++ 1 file changed, 210 insertions(+) create mode 100644 conductor/tracks/ai_loop_regressions_20260614/spec.md diff --git a/conductor/tracks/ai_loop_regressions_20260614/spec.md b/conductor/tracks/ai_loop_regressions_20260614/spec.md new file mode 100644 index 00000000..c7468072 --- /dev/null +++ b/conductor/tracks/ai_loop_regressions_20260614/spec.md @@ -0,0 +1,210 @@ +# Track: AI Loop Regressions (MiniMax, Gemini, Gemini CLI, DeepSeek) + +**Status:** Active (spec approved 2026-06-14) +**Initialized:** 2026-06-14 +**Owner:** Tier 2 Tech Lead +**Priority:** High (4 providers broken in production; user-facing symptom) + +--- + +## 1. Overview + +This track diagnoses and fixes 4 user-visible regressions in the AI loop that surfaced after the `data_oriented_error_handling_20260606` track shipped (2026-06-12) and the subsequent `ai client pass` commit `5030bd84` (2026-06-13, 503-line `src/ai_client.py` refactor in the Gemini region). The regressions affect **MiniMax (M2.x), Gemini, Gemini CLI, and DeepSeek** — the 4 providers most heavily touched by the refactor. + +The reported symptoms (per user 2026-06-14): +1. **Thinking monologues no longer render** in the Discussion Hub. +2. **AI turns do not get entries** in the Discussion Hub; the user must manually add them via the `History` button. + +The 2 symptoms are the visible result of **3 distinct bugs** interacting. Bug #2 is the primary culprit for the "no entries" symptom; Bug #3 is the primary culprit for the "no thinking" symptom on MiniMax; Bug #1 is dead code that breaks the error-reporting path. The user-supplied screenshots show entries in the Operations Hub `Comms History` and in the `Comms History` panel — confirming the requests reach the AI client and responses are emitted, but the response doesn't propagate to the discussion panel. + +## 2. Goals (Priority Order) + +| Priority | Goal | Rationale | +|---|---|---| +| **A (primary value)** | Fix Bug #2: `_handle_request_event` (the live AI send path) routes `send_result()` errors back into the Discussion Hub as error entries, restoring the pre-refactor UX. | The "no entries" symptom is the user-blocking bug. Fixing it makes the AI loop immediately usable again. | +| **A (primary value)** | Fix Bug #3: MiniMax thinking content (`reasoning_details[0].text`) is wrapped in `...` tags in the returned text, so `thinking_parser.parse_thinking_trace` can extract it and the discussion entry shows the thinking segment. | MiniMax is the user's current provider; thinking monologues are a core feature. Without this fix the user cannot see the AI's reasoning. | +| **B (architectural)** | Fix Bug #1: replace the 3 dead `except ai_client.ProviderError as e:` clauses in `src/app_controller.py` with the equivalent `send_result()` + `if not result.ok: ...` pattern. | The dead clauses silently swallow the `AttributeError` that arises when Python tries to evaluate `ai_client.ProviderError` to compare against the in-flight exception. The replacement aligns with the data-oriented error handling convention and gives Tier 2 a clean reference for the planned `public_api_migration_20260606` follow-up. | +| **C (diagnostic)** | Root-cause verification: each of the 3 fixes is preceded by a failing TDD test that reproduces the bug, and a commit history audit is documented in the spec. | The user explicitly asked for an investigation track. The diagnostic tests are the empirical evidence for each root cause. | +| **D (forward-looking)** | Document the deferred Gemini / Gemini CLI thinking-format investigation as a follow-up note in `docs/guide_ai_client.md` "See Also" section. | The user's complaint includes Gemini, but the format-compatibility issue is plausibly a pre-existing limitation, not a new regression. Documented as a follow-up to avoid scope creep. | + +### 2.1 Non-Goals (this track) + +- **Not** migrating the 5 remaining production call sites or 63 test call sites to `send_result()`. The planned `public_api_migration_20260606` follow-up track handles that. This track only migrates the 3 sites that are actively broken (the dead `except` clauses in `app_controller.py:305, 313, 3692`) — the minimum needed to make the live path work. +- **Not** expanding the `thinking_parser.py` contract to support new marker formats. The ``, ``, and `Thinking:` markers are the canonical set; the MiniMax fix uses the existing `` format (matches DeepSeek's pattern). +- **Not** investigating or fixing the Gemini / Gemini CLI thinking-format compatibility (deferred; see §13.1). +- **Not** changing the `ProviderError` removal (it was correctly removed in commit `64b787b8`); we only fix the dead except clauses. +- **Not** adding a new `thinking_parser` format; we work within the existing 3-marker contract. + +## 3. Current State Audit (as of commit `5030bd84`) + +### 3.1 Already Implemented (DO NOT re-implement) + +- **`src/result_types.py`**: `Result[T]`, `ErrorInfo`, `ErrorKind` dataclasses exist; `Result.data: T` + `Result.errors: list[ErrorInfo]` is the canonical pattern. +- **`src/ai_client.py:send_result()`** (lines 2970-3092): the new public entry point, returns `Result[str]`. Routes to `_send__result()` per provider. +- **`src/ai_client.py:send()`** (lines 2907-2968): the `@deprecated` shim, calls `send_result()` and returns `result.data`. **Never raises on error** — returns `""` instead. +- **`src/ai_client.py:_send_*_result()`** (lines 1291-3082): all 9 vendors (`anthropic`, `gemini`, `gemini_cli`, `deepseek`, `minimax`, `qwen`, `grok`, `llama`, `llama_native`) return `Result[str]` with `ErrorInfo` on failure. +- **`src/ai_client.py:run_with_tool_loop()`** (lines 734-836): already extracts reasoning via `reasoning_extractor` and stores it in `history[].reasoning_content`. The reasoning content is in the history but **NOT** in the returned text. +- **`src/thinking_parser.py:parse_thinking_trace()`** (lines 8-54): already extracts ``, ``, and `Thinking:` prefix segments. +- **`src/app_controller.py:_on_comms_entry()`** (lines 3749-3840): already routes `response` comms entries to `_pending_history_adds` if `text_content.strip()` is truthy and `parse_thinking_trace` finds segments. +- **DeepSeek's reasoning wrap pattern** (`src/ai_client.py:2117-2118`): DeepSeek wraps `reasoning_content` in `...` tags in the final text before returning. This is the reference pattern for the MiniMax fix. + +### 3.2 Gaps to Fill (This Track's Scope) + +| # | File:line | Gap | Symptom | +|---|---|---|---| +| **G1** | `src/app_controller.py:3677-3697` | `_handle_request_event` calls deprecated `ai_client.send()` and discards the result. On error, `result.data == ""` is queued as a `response` comms entry, but `_on_comms_entry` at line 3801 filters it out via `if text_content.strip():`. No discussion entry is added. | "AI turns are not getting proper entries" | +| **G2** | `src/app_controller.py:305, 313, 3692` | Three `except ai_client.ProviderError as e:` clauses reference a class that was removed in commit `64b787b8`. Python evaluates the class on every raised exception; on missing class, the except clause itself raises `AttributeError`. The error path is broken. | Silently dropped error messages (compounding G1) | +| **G3** | `src/ai_client.py:797-836, 2418-2443` | `_send_minimax()` uses `reasoning_extractor` to extract reasoning into `history[].reasoning_content`, but the returned `response_text` (and thus `Result.data`) does not include the thinking tags. `parse_thinking_trace` finds no `` blocks, so no thinking segments are added to the discussion entry. | "Thinking monologues no longer rendering" (MiniMax) | +| **G4** | (deferred) `src/ai_client.py:_send_gemini`, `_send_gemini_cli` | Gemini SDK output may include thinking in a format that `parse_thinking_trace` doesn't match. Empirical verification needed. | "Thinking monologues no longer rendering" (Gemini) | + +## 4. Functional Requirements + +### FR1: Error response becomes a discussion entry (Bug #2 / G1) + +`_handle_request_event` in `src/app_controller.py:3677-3697` must: + +1. Call `ai_client.send_result()` instead of `ai_client.send()`. +2. On `result.ok == False`: queue a `response` comms entry with `text=ui_error_message()`, `status="error"`, `role="Vendor API"` so the user sees the error in both the AI response panel AND as a discussion entry. +3. On `result.ok == True`: queue a `response` comms entry with `text=result.data`, `status="done"`, `role="AI"` (preserves current behavior). +4. Update `_ai_status` to `f"error: {ui_error_message()}"` on failure (preserves the visible status indicator). +5. Preserve the existing streaming path (`_on_ai_stream` continues to receive chunks during `stream=True` execution). + +### FR2: Replace dead `except ai_client.ProviderError` clauses (Bug #1 / G2) + +All 3 sites in `src/app_controller.py` (`305, 313, 3692`) must: + +1. Remove the `except ai_client.ProviderError` clause. +2. Replace with either: + - **For sites that call `ai_client.send()`**: call `ai_client.send_result()` instead; if `not result.ok`, route the error to the API response (HTTPException for the API sites, comms queue for the live site). + - **For sites that call other `ai_client` methods that raise**: use a generic `except Exception` and convert to a structured response (HTTPException for API sites, error entry for the live site). +3. Reference the data-oriented error handling styleguide (`conductor/code_styleguides/error_handling.md` §3.1) in the resulting code's docstring (so future migrations follow the same pattern). + +### FR3: MiniMax thinking content reaches `parse_thinking_trace` (Bug #3 / G3) + +`_send_minimax` in `src/ai_client.py:2418-2443` (or `run_with_tool_loop` at lines 797-836) must: + +1. When `caps.reasoning` is True AND the previous round extracted non-empty `reasoning_content`, the NEXT round's `response_text` (and `Result.data`) must include the reasoning wrapped in `...` tags (matching DeepSeek's pattern at `src/ai_client.py:2117-2118`). +2. The `run_with_tool_loop` history write at line 808 must continue to store the raw `reasoning_content` (so subsequent API calls can use it for the next turn's reasoning). The thinking tag wrapping is additive: the raw reasoning is in the history, the tagged reasoning is in the visible text. +3. The `...` format used by some MiniMax models (visible in the user-supplied screenshot 1) must continue to work — `parse_thinking_trace` already supports it (the regex at `src/thinking_parser.py:22` matches `` and ``; the screenshot shows the `` format which is **not** currently supported — this is a separate bug and is deferred to the follow-up). + + **Important scope clarification**: The user's screenshot shows `This is DWARF debug info...` style — using the half-width `` (no closing match for the regex). The MiniMax fix in this track wraps the reasoning in `` (the supported form), not ``. This is a temporary scope reduction: the fix restores thinking-mono rendering for the common case (DeepSeek-style `` tags), and the half-width `` format is a known gap that's documented as a follow-up. + +### FR4: No new files in `src/` + +Per the project's hard rule (see `AGENTS.md` "File Size and Naming Convention"), no new `src/.py` files. All fixes go in: +- `src/app_controller.py` (FR1, FR2) +- `src/ai_client.py` (FR3) + +### FR5: Tests cover all 3 fixes + +- `tests/test_ai_loop_regressions_20260614.py` (new file): TDD tests for FR1, FR2, FR3. + - **FR1 tests** (3+ tests): (a) successful response becomes a discussion entry; (b) error response becomes a discussion entry with `status="error"`; (c) `_ai_status` is updated correctly on both paths. + - **FR2 tests** (2+ tests): (a) the dead `except ProviderError` clause is removed (assert no longer present via AST scan); (b) the replaced code path correctly raises HTTPException for the API sites. + - **FR3 tests** (2+ tests): (a) `_send_minimax` returns `Result.data` that contains `` tags when reasoning is extracted; (b) the discussion entry's `thinking_segments` field is populated when `parse_thinking_trace` is run on the result. + +## 5. Non-Functional Requirements + +- **NFR1 (Atomic per-task commits)**: each plan task is one commit; no batching. +- **NFR2 (1-space indentation)**: enforced by the project's AI-Optimized Python style. +- **NFR3 (No diagnostic noise in production)**: no `sys.stderr.write("[XYZ_DIAG] ...")` lines in the committed code. If instrumentation is needed for the TDD test, it goes to `tests/artifacts/.diag.log` (not in the test file itself). +- **NFR4 (Backward compatibility)**: the deprecated `ai_client.send()` shim remains working (the `public_api_migration_20260606` track is responsible for removal; this track only fixes the 3 broken except clauses). +- **NFR5 (No regression in other providers)**: the 5 unaffected providers (Anthropic, Qwen, Grok, Llama, Llama native) must continue to pass their existing tests. +- **NFR6 (Thread safety)**: all fixes preserve the existing `_send_lock` and per-provider history locks; the fix for FR1 must not introduce a new race between the streaming `_on_ai_stream` callback and the final `result.data` write. + +## 6. Architecture Reference + +For implementation details, consult: + +- **`docs/guide_ai_client.md`**: the canonical guide for `src/ai_client.py`; the new `send_result()` API is documented in the "Data-Oriented Error Handling (Fleury Pattern) > Public API" section. FR1 and FR3 should follow the patterns shown there. +- **`docs/guide_app_controller.md`**: the canonical guide for `src/app_controller.py`; the `_handle_request_event` and `_on_comms_entry` flows are described in §"AI Loop Lifecycle". FR1 and FR2 changes are in this subsystem. +- **`docs/guide_thinking.md`** (if it exists; otherwise `docs/guide_discussions.md`): the canonical guide for thinking-mono rendering; the `parse_thinking_trace` markers are documented in §"Thinking Markers". +- **`conductor/code_styleguides/error_handling.md`**: the canonical reference for the Result/ErrorInfo pattern; the new FR2 code paths should follow §3.1 "AND over OR (Result struct with side-channel errors)". +- **`docs/reports/data_oriented_error_handling_phase3_20260612.md`** (if it exists; otherwise the metadata.json `deprecation_strategy` section of the parent track): documents the `send_result()` deprecation strategy and the planned `public_api_migration_20260606` follow-up. + +## 7. Out of Scope + +- **Gemini / Gemini CLI thinking-format compatibility investigation** (Bug #4 / G4). The user's complaint includes Gemini, but the format may be a pre-existing limitation. Documented as a follow-up in §13.1. +- **Migrating the remaining 5 production call sites + 63 test call sites to `send_result()`**. The planned `public_api_migration_20260606` track handles this. +- **Expanding `thinking_parser.py` to support new marker formats** (e.g., `` without closing ``). +- **Restructuring `_handle_request_event` to be testable in isolation** (a follow-up; this track's tests use mocks for the AI client, not the controller). +- **Any changes to the `multi_agent_conductor.py` MMA worker interface** (it still uses `send()`; will migrate in the public_api track). +- **Restoring the `` (half-width) marker support**. The user's screenshot shows this format; the current `parse_thinking_trace` regex requires `` (full-width). This is a separate gap documented in §13.2. + +## 8. Phases (Summary) + +| Phase | Name | Tasks | Verification | +|---|---|---|---| +| **Phase 1** | **Root-cause verification** (TDD red) | 3 tasks: write 3+ failing tests for FR1, FR2, FR3; commit each as a separate test | `pytest tests/test_ai_loop_regressions_20260614.py` shows red | +| **Phase 2** | **Fix FR1 (Bug #2): error response becomes a discussion entry** | 3 tasks: implement the fix in `_handle_request_event`; run the FR1 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr1*` shows green | +| **Phase 3** | **Fix FR2 (Bug #1): replace dead `except ProviderError` clauses** | 3 tasks: replace 3 sites; run the FR2 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr2*` shows green; AST scan shows no `ProviderError` references | +| **Phase 4** | **Fix FR3 (Bug #3): MiniMax thinking mono rendering** | 3 tasks: wrap reasoning in `` tags in `_send_minimax` (or in `run_with_tool_loop`); run the FR3 tests; commit | `pytest tests/test_ai_loop_regressions_20260614.py::test_*fr3*` shows green | +| **Phase 5** | **Regression sweep + docs** | 3 tasks: run full `pytest tests/`; add follow-up note to `docs/guide_ai_client.md` "See Also" section; commit | Full suite green; doc note present | + +## 9. Risk Analysis + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| **R1**: The Phase 4 fix (MiniMax thinking wrap) breaks the existing DeepSeek tests because both use `run_with_tool_loop`. | Medium | High | Apply the wrap only when `reasoning_extractor` is set AND returns non-empty; preserve the DeepSeek-specific path (which already wraps). The fix is conditional on `caps.reasoning`, not universal. | +| **R2**: The FR1 fix changes the streaming behavior — the streaming chunks go through `_on_ai_stream` (via `stream_callback`), and the final `result.data` is set after streaming completes. The fix must not break the existing streaming contract. | Medium | High | The FR1 fix only changes the FINAL response comms entry (after `send_result()` returns). The streaming path is unchanged. Phase 2's test must include a streaming test to lock this. | +| **R3**: The 3 sites in `app_controller.py` that have `except ProviderError` may have other callers depending on the exception behavior. | Low | Medium | All 3 sites are in `_handle_request_event` (1 site) and 2 API hook endpoints (`/api/v1/generate`, `/api/v1/generate_sync`). The fix routes errors the same way the original code intended, just via `Result.ok` instead of `ProviderError`. | +| **R4**: The `parse_thinking_trace` regex is greedy; wrapping thinking in `` tags and then parsing it may produce nested segments. | Low | Low | The regex at `src/thinking_parser.py:9` is `re.DOTALL \| re.IGNORECASE` and uses `.*?` (non-greedy). Nested `` blocks would not match because the outer block consumes the inner; this is the same behavior DeepSeek has, and the existing tests pass for DeepSeek. | +| **R5**: The user is wrong about Gemini / Gemini CLI — those may not actually be broken. | Medium | Low | The deferred Phase-5-style follow-up will investigate empirically. The user's primary report was MiniMax; the other 3 are mentioned as "all regressed" but the fix for Bug #1 (dead except clauses) and Bug #2 (empty data) restores them all to working order. The thinking-mono issue is MiniMax-specific. | + +## 10. Coordination with Pending Tracks + +This track is **independent** (no `blocked_by` and no `blocks` in `metadata.json`). It does not depend on or block any active track. + +However, it interacts with: +- **`public_api_migration_20260606`** (planned, not yet specced): this track's FR1 fix to `_handle_request_event` is a partial migration. The full migration (5 production + 63 test sites) is out of scope here; the follow-up track picks up where this leaves off. The two tracks share the same destination but this track fixes the user-blocking regressions first. +- **`data_oriented_error_handling_20260606`** (shipped 2026-06-12): this track is the user-facing bug-fix for the issues introduced by the parent track. It does not modify any of the 3 files the parent track touched (`mcp_client.py`, `ai_client.py`, `rag_engine.py`); it only modifies `app_controller.py` (the 1 file the parent track did NOT touch). The MiniMax fix touches `ai_client.py` for FR3 (1 file the parent touched). +- **`qwen_llama_grok_followup_20260611`** (archived 2026-06-11): no direct interaction, but the MiniMax fix in FR3 follows the same reasoning-extraction pattern that track introduced for the OpenAI-compatible providers. + +## 11. Verification Criteria (definition of "done") + +The track is complete when ALL of the following are true: + +- [ ] All 3 phase 1-4 tests pass (`pytest tests/test_ai_loop_regressions_20260614.py` shows green). +- [ ] Full test suite passes (`uv run pytest tests/` shows green; no new failures). +- [ ] `grep -rn "ProviderError" src/` returns no matches. +- [ ] `grep -rn "ai_client\.ProviderError" src/` returns no matches. +- [ ] Live GUI test: a MiniMax `M2.7` request with reasoning returns a discussion entry that includes a `thinking_segments` field (use the `live_gui` fixture + `ApiHookClient.get_value("disc_entries")`). +- [ ] Live GUI test: a MiniMax request that fails (e.g., invalid API key) returns a discussion entry with `status="error"` and the error message in the `content` field. +- [ ] Live GUI test: a Gemini request that succeeds returns a discussion entry (verifies the FR1 fix doesn't break Gemini). +- [ ] `docs/guide_ai_client.md` "See Also" section includes the 2 follow-up notes (§13.1 Gemini thinking investigation, §13.2 `` half-width marker support). +- [ ] `metadata.json` `verification_criteria` field is updated to reflect completion. + +## 12. Acceptance Test (the user can verify this themselves) + +After this track ships, the user should be able to: + +1. Open Manual Slop with MiniMax as the active provider. +2. Send a message that requires the AI to reason (e.g., "explain the structure of this function"). +3. Verify: the AI's response appears in the Discussion Hub **without** manually pressing the `History` button. +4. Verify: the response has a `Monologue` collapsible section showing the AI's thinking. +5. Trigger a failure (e.g., switch to an invalid MiniMax API key, then send a message). +6. Verify: an error entry appears in the Discussion Hub with the error message. + +Before this track ships, steps 3 and 4 fail (for MiniMax); step 6 fails (for all 4 affected providers). + +## 13. See Also — Follow-up Notes + +### 13.1 Gemini / Gemini CLI thinking-format compatibility (deferred) + +The user's complaint includes Gemini and Gemini CLI. The likely cause is a format mismatch between what the Gemini SDK outputs and what `parse_thinking_trace` recognizes: + +- `parse_thinking_trace` (`src/thinking_parser.py:9`) matches ``, ``, and `Thinking:` prefix. +- The Gemini SDK's `resp.text` may include thinking as plain prose or as `*thinking aloud*` markdown, depending on the SDK version and the model's prompt formatting. + +This track fixes Bugs #1, #2, #3. The Gemini / Gemini CLI thinking-format issue is plausibly a pre-existing limitation (the existing tests for `parse_thinking_trace` show it doesn't match all Gemini output formats) rather than a new regression from the recent refactor. + +**Follow-up track** (to be specced): investigate empirically by running a Gemini request that produces reasoning and inspecting the raw `resp.text`; add a normalization pass in `_send_gemini*` if needed. + +### 13.2 `` (half-width) marker support (deferred) + +The user's screenshot 1 shows a discussion entry containing `This is DWARF debug info, not the actual disassembly...` — the half-width `` form (no closing `` in the regex). The current `parse_thinking_trace` regex (`src/thinking_parser.py:9`) requires the full `` form. Some models (notably certain DeepSeek-R1 outputs and possibly the MiniMax M2.7 output) use the half-width `` form. + +**Follow-up track** (to be specced): extend `parse_thinking_trace` to support the half-width `...` form (the closing tag is the same). The change is small (~3 lines in `src/thinking_parser.py:9`); the test file is `tests/test_thinking_trace.py` (5+ existing tests for the full-width form). + +### 13.3 Public API Result Migration (planned, separate) + +The `public_api_migration_20260606` follow-up (planned, not yet specced) will migrate the 5 remaining production call sites and 63 test call sites to `send_result()`. This track fixes the 3 sites in `app_controller.py` that are actively broken; the public_api track picks up from there.