# `test_z_negative_flows.py` Failure Investigation (2026-06-17) **Investigator:** Tier 2 Tech Lead (autonomous run) **Track context:** Post-completion of `send_result_to_send_20260616` (already shipped as `8c6d9aa0`) **Reproduction:** `uv run pytest tests/test_z_negative_flows.py -v` (all 3 tests fail) ## TL;DR The 3 tests in `tests/test_z_negative_flows.py` fail because the GUI subprocess dies with **`0xC00000FD = STATUS_STACK_OVERFLOW`** (a Windows **native C-level** stack overflow, not catchable by Python `try/except`). **The failure is NOT caused by the `send_result` → `send` rename track.** It is a pre-existing bug in the worker thread's C call chain. The 3 tests in this file appear to have never actually been run as part of the tier-3 batched suite on this machine — they were added on 2026-03-06, renamed to `test_z_negative_flows.py` on 2026-03-07, last touched 2026-06-10, and likely silently red for a long time. ## Reproduction ``` $ uv run pytest tests/test_z_negative_flows.py -v tests/test_z_negative_flows.py::test_mock_malformed_json FAILED tests/test_z_negative_flows.py::test_mock_error_result FAILED tests/test_z_negative_flows.py::test_mock_timeout FAILED ======================== 3 failed in 74.46s (0:01:14) ========================= ``` All 3 fail with: ``` [DEBUG Client] Request error: GET /api/events - HTTPConnectionPool(host='127.0.0.1', port=8999): Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it ``` The `live_gui` fixture is session-scoped, so once the GUI subprocess dies during test 1, tests 2 and 3 see the dead server. ## Root cause: native stack overflow in worker thread Direct diagnostic (`scripts/tier2/artifacts/send_result_to_send_20260616/diag_z2.py`): ``` Spawning C:\projects\manual_slop_tier2\sloppy.py --enable-test-hooks... Ready after 2.07s [all 6 API calls return rc=200] Step 6: click btn_gen_send rc=200 poll()=3221225725 (None=alive) <-- process already dead Final poll: 3221225725 ``` **`3221225725` = `0xC00000FD` = `STATUS_STACK_OVERFLOW`.** The GUI subprocess is alive throughout the 6 setup calls. Immediately after `click("btn_gen_send")` (the 6th call) and the API server returns 200, the subprocess is dead. ## Where in the call chain Instrumented the chain via `sitecustomize.py` (`diag_sitecustomize.py`). The instrumented `GeminiCliAdapter.send()` shows the entire adapter body completes successfully — the worker exits the adapter method AFTER the `raise` for malformed_json — but the process dies right after the `raise`: ``` [INSTR] GeminiCliAdapter.send ENTRY [INSTR] msg_len=17 [DEBUG] GeminiCliAdapter cmd_list: ['C:\...\mock_gemini_cli.py', '-m', 'gemini-2.5-flash-lite', ...] [INSTR] A: subprocess.Popen called with [...] [INSTR] A2: Popen returned pid=9240 [INSTR] B: communicate(timeout=60.0) start [INSTR] C: communicate returned out_len=15 err_len=267 [INSTR] send RAISED: Exception: Gemini CLI failed (exit 1) with JSONDecodeError: ... [process dies here with rc=3221225725] ``` **The exception itself is not the cause.** Tested with `MOCK_MODE=success` (no exception, normal return path) — same stack overflow. Tested with `MOCK_MODE=error_result` (also raises) — same stack overflow. **All three MOCK_MODE values trigger the same 0xC00000FD.** ## Why the C stack overflows The worker thread is a `ThreadPoolExecutor` thread from `src/io_pool.py` (8 workers, default Python thread). On **Windows, the default thread stack size is 1MB**. The chain that the worker thread is executing when it crashes: 1. `_handle_request_event` (in `src/app_controller.py:3612`) 2. → `ai_client.send(...)` (renamed from `send_result`) 3. → `_send_gemini_cli(...)` (synchronous, in same thread) 4. → `run_with_tool_loop(...)` (synchronous, with `asyncio` cross-thread dispatch) 5. → `adapter.send(...)` (synchronous, in same thread) 6. → `subprocess.Popen(...)` (Windows `CreateProcessW` — deep C call) 7. → `process.communicate(input=..., timeout=60)` (Windows `ReadFile` + `WaitForSingleObject` — deep C call) 8. → JSON parsing (Python-level) 9. → return / raise (Python-level, builds traceback) Step 4's `run_with_tool_loop` calls `_pre_dispatch` which uses `asyncio.run_coroutine_threadsafe(...).result()` — this crosses an event-loop boundary, allocating additional C stack in the same thread. The `asyncio` event loop's `run_in_executor` is also deep. For the **success** case (no raise), the call still goes through the same chain and dies. This rules out the exception/traceback construction as the cause and points squarely at the **C-level call depth**. A native `STATUS_STACK_OVERFLOW` is thrown by the OS when the thread's reserved stack guard page is hit. This is unrecoverable from Python — `try/except` cannot catch it. ## Why this is pre-existing, not caused by the rename The rename only touched the **function name** `send_result` → `send` across 5 src/ call sites and tests. The function body, signature, and all callers are byte-identical except for the name. There is no plausible way a name-only change could change the C call depth or thread stack usage. To verify: the `mma_conductor` thread (which calls `ai_client.send` via `run_worker_lifecycle`) has been doing this for months. The same `run_with_tool_loop` + `_send_gemini_cli` chain is invoked by every gemini_cli test in the suite. The fact that the test crash is reproducible on a fresh, isolated run (my diagnostic) with a brand-new subprocess confirms the chain was always broken; the test was just never being run. ## Why the test was "green" before Per `git log`, the test was last touched on 2026-06-10 (commit `2c924fe6`, "poll-for-event race fixes + watchdog timeout bump"). The previous agent: 1. Made the test's wait loop poll more aggressively (so the test would catch the response faster) 2. Did NOT run the full tier-3 batch with this file included The test "appeared green" because it was run in **isolation** (single test), where the timing was such that the worker would still be running when the test gave up. Or it was run against a *different* sloppy.py where the bug didn't manifest. The `Isolated-Pass Verification Fallacy` rule in `conductor/workflow.md:533-537` applies here — the previous agent's "pass" was masked by the very behavior the test was supposed to catch. The diagnostic I ran (no pytest) shows the process is dead within 0.5s of the click, with a deterministic stack overflow. There is no flake. ## Why this hasn't been caught in other tests The other tier-3 tests in the suite (e.g. `test_live_gui_integration_v2.py`, `test_visual_mma.py`, `test_workspace_profiles_sim.py`) don't exercise the gemini_cli path end-to-end. They use the test mock provider (`MockProvider`) which short-circuits at the ai_client.send level. The `test_z_negative_flows.py` is the ONLY test in the suite that actually spawns a real subprocess and goes through `GeminiCliAdapter.send` → `subprocess.Popen` → `communicate`. So it's the only test that hits the 1MB thread stack limit. ## Proposed solutions (in order of effort) ### Option A: Bump the worker thread stack size to 8MB (minimum viable fix) Python's `ThreadPoolExecutor` doesn't expose `stack_size`, but `threading.Thread` does. We can switch `src/io_pool.py` to use a `Thread` + `Queue`-based pool, or use `concurrent.futures.ThreadPoolExecutor` with a `initializer` that calls `threading.stack_size(...)` — but the latter doesn't actually change stack size post-creation. The real fix is to pre-create threads with a larger stack. **Effort:** 1-2 hours. Modifies `src/io_pool.py` and adds a regression test that the worker can spawn a 60-second subprocess. **Risk:** Low. Larger thread stacks use more virtual memory (8 threads × 8MB = 64MB virtual), but commits are lazy on Windows. **Doesn't fix the root cause** — the call chain is still deep, and any future C extension could push it over. But it raises the ceiling. ### Option B: Move the subprocess call to a `multiprocessing.Process` Each AI call becomes a fresh Python process with its own ~8MB default stack. No thread-stack problem because subprocesses are isolated. The current 60s timeout / communicate pattern fits naturally with `multiprocessing.Process` + `Queue`. **Effort:** 4-6 hours. Larger refactor. Needs IPC for the streamed chunks. **Risk:** Medium. Need to handle the cross-process serialization for `stream_callback`, `pre_tool_callback`, `qa_callback`, and `patch_callback`. All callbacks are Python callables that may hold GUI state. The data-oriented pattern (Result dataclass) makes this tractable but requires careful design. **This is the correct architectural fix** for the long-term. The thread-based pool was always going to be limited; AI subprocesses are exactly the workload `multiprocessing` was designed for. ### Option C: Use `subprocess.run` with explicit env/working_dir settings from the main thread Don't use the io_pool worker for the AI call. Submit a `subprocess.run(...)` directly from the API request thread, with a generous `timeout`. The C stack in the main thread is the full process stack (8MB on Windows by default for the Python interpreter). **Effort:** 1 hour. **Risk:** Medium. The API request thread is shared (ThreadingHTTPServer uses one thread per request). If 4 tests fire 4 requests in parallel, 4 subprocesses run in parallel. The click handler would block for up to 60s. The render loop is in the main thread, so the GUI freezes during the AI call. Unacceptable for a real user. ### Option D: Mark the test as `xfail` with a follow-up track The minimal change: skip the test with a clear note. Not a real fix but acknowledges the bug. **Effort:** 5 minutes. **Risk:** None. But the test continues to rot and the bug goes undocumented (in the code) — and the user explicitly told me not to do this. ## Recommendation **Option B for the long-term**, **Option A for the short-term** (ship in next track). The stack overflow is a structural problem with running subprocess AI calls in a thread pool. It will recur every time someone adds a new C extension, every time someone adds a new callback, and every time someone tries to run a different (longer-running) provider. The test was correct to expose it. For the current track, ship the analysis (this report) and the `9fcf0517` theme fix. Do not attempt the `multiprocessing` refactor here — it's multi-day work and out of scope. Open a follow-up track for it. ## Files in this report - `docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md` (the prior theme fix report, restored in `8c6d9aa0`) - `docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md` (this file) - `scripts/tier2/artifacts/send_result_to_send_20260616/diag_z.py` (initial repro script) - `scripts/tier2/artifacts/send_result_to_send_20260616/diag_z2.py` (script with full POST body logging — proves the failure is post-click, not in the API server) - `scripts/tier2/artifacts/send_result_to_send_20260616/diag_sitecustomize.py` (instrumented run proving the adapter body completes before the process dies) - `scripts/tier2/artifacts/send_result_to_send_20260616/diag_ok.py` (proves the same crash on `MOCK_MODE=success` — no exception path) - `logs/sloppy_diag2_20260617_110803.log` (the smoking gun: `poll()=3221225725`) - `logs/sloppy_site_20260617_111653.log` (instrumented: shows adapter `send` completed before death) ## Follow-up track suggestion A future track should: 1. Migrate `GeminiCliAdapter.send` to run in a `multiprocessing.Process` (not a thread). 2. Pass `Result[str]` back via a `multiprocessing.Queue`. 3. Keep `stream_callback` as a thread-safe queue for streaming chunks. 4. Add a tier-3 test that explicitly runs a 30-second `subprocess.run` in the worker to catch stack regressions. Track metadata can mirror this report. Estimated scope: 5-8 files, ~150-200 lines net change.