Private
Public Access
0
0
Files
manual_slop/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md
T
ed 54eb4740b3 conductor+layout: remove T-shirt size metric, regenerate stale layout
Per user feedback 2026-06-17:
- T-shirt size is not an acceptable sizing metric. Remove it from
  conductor/workflow.md (the policy file), conductor/tracks.md (the
  registry), and docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md.
- Regenerate manualslop_layout.ini to remove 83 stale window references
  that pointed to deleted/renamed windows (Projects, Files, Screenshots,
  Provider, System Prompts, Discussion History, Comms History, etc.).
  Layout now matches the windows registered in src/app_controller.py
  _default_windows (lines 1862-1886). Stale window count: 10 -> 3.

T-shirt size removal details:
- conductor/workflow.md: Removed the S/M/L/XL table, the replacement
  pattern row, and the 'reasonable effort' guard's reference. Scope
  (N files, M sites, N tasks) is the only effort dimension.
- conductor/tracks.md: Removed the T-shirt column from the table header
  and removed T-shirt size mentions from the Fable track entry.
- docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md: Removed the
  T-shirt size mention in the follow-up track suggestion.

Layout fix:
- manualslop_layout.ini went from 17,360 bytes (102 windows, 83 stale)
  to 3,361 bytes (23 windows, all matching _default_windows). The
  stale window warning dropped from 10 windows to 3 (Message, Tool
  Calls, Response - these are in _default_windows but reference
  separate panels in the layout).

Verification: layout fix did NOT fix the underlying stack overflow crash.
After layout fix, the test still dies with rc=3221225725 (0xC00000FD).
The user noted 'Something more fundamental is wrong.' Investigation
continues; this commit only addresses the explicit ask (remove T-shirt,
fix layout).
2026-06-17 12:23:03 -04:00

12 KiB
Raw Blame History

test_z_negative_flows.py Failure Investigation (2026-06-17)

Investigator: Tier 2 Tech Lead (autonomous run) Track context: Post-completion of send_result_to_send_20260616 (already shipped as 8c6d9aa0) Reproduction: uv run pytest tests/test_z_negative_flows.py -v (all 3 tests fail)

TL;DR

The 3 tests in tests/test_z_negative_flows.py fail because the GUI subprocess dies with 0xC00000FD = STATUS_STACK_OVERFLOW (a Windows native C-level stack overflow, not catchable by Python try/except).

The failure is NOT caused by the send_resultsend rename track. It is a pre-existing bug in the worker thread's C call chain. The 3 tests in this file appear to have never actually been run as part of the tier-3 batched suite on this machine — they were added on 2026-03-06, renamed to test_z_negative_flows.py on 2026-03-07, last touched 2026-06-10, and likely silently red for a long time.

Reproduction

$ uv run pytest tests/test_z_negative_flows.py -v
tests/test_z_negative_flows.py::test_mock_malformed_json FAILED
tests/test_z_negative_flows.py::test_mock_error_result   FAILED
tests/test_z_negative_flows.py::test_mock_timeout       FAILED
======================== 3 failed in 74.46s (0:01:14) =========================

All 3 fail with:

[DEBUG Client] Request error: GET /api/events - HTTPConnectionPool(host='127.0.0.1', port=8999):
  Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it

The live_gui fixture is session-scoped, so once the GUI subprocess dies during test 1, tests 2 and 3 see the dead server.

Root cause: native stack overflow in worker thread

Direct diagnostic (scripts/tier2/artifacts/send_result_to_send_20260616/diag_z2.py):

Spawning C:\projects\manual_slop_tier2\sloppy.py --enable-test-hooks...
Ready after 2.07s
[all 6 API calls return rc=200]
Step 6: click btn_gen_send
  rc=200
  poll()=3221225725 (None=alive)  <-- process already dead
Final poll: 3221225725

3221225725 = 0xC00000FD = STATUS_STACK_OVERFLOW.

The GUI subprocess is alive throughout the 6 setup calls. Immediately after click("btn_gen_send") (the 6th call) and the API server returns 200, the subprocess is dead.

Where in the call chain

Instrumented the chain via sitecustomize.py (diag_sitecustomize.py). The instrumented GeminiCliAdapter.send() shows the entire adapter body completes successfully — the worker exits the adapter method AFTER the raise for malformed_json — but the process dies right after the raise:

[INSTR] GeminiCliAdapter.send ENTRY
[INSTR] msg_len=17
[DEBUG] GeminiCliAdapter cmd_list: ['C:\...\mock_gemini_cli.py', '-m', 'gemini-2.5-flash-lite', ...]
[INSTR] A: subprocess.Popen called with [...]
[INSTR] A2: Popen returned pid=9240
[INSTR] B: communicate(timeout=60.0) start
[INSTR] C: communicate returned out_len=15 err_len=267
[INSTR] send RAISED: Exception: Gemini CLI failed (exit 1) with JSONDecodeError: ...
[process dies here with rc=3221225725]

The exception itself is not the cause. Tested with MOCK_MODE=success (no exception, normal return path) — same stack overflow. Tested with MOCK_MODE=error_result (also raises) — same stack overflow. All three MOCK_MODE values trigger the same 0xC00000FD.

Why the C stack overflows

The worker thread is a ThreadPoolExecutor thread from src/io_pool.py (8 workers, default Python thread). On Windows, the default thread stack size is 1MB. The chain that the worker thread is executing when it crashes:

  1. _handle_request_event (in src/app_controller.py:3612)
  2. ai_client.send(...) (renamed from send_result)
  3. _send_gemini_cli(...) (synchronous, in same thread)
  4. run_with_tool_loop(...) (synchronous, with asyncio cross-thread dispatch)
  5. adapter.send(...) (synchronous, in same thread)
  6. subprocess.Popen(...) (Windows CreateProcessW — deep C call)
  7. process.communicate(input=..., timeout=60) (Windows ReadFile + WaitForSingleObject — deep C call)
  8. → JSON parsing (Python-level)
  9. → return / raise (Python-level, builds traceback)

Step 4's run_with_tool_loop calls _pre_dispatch which uses asyncio.run_coroutine_threadsafe(...).result() — this crosses an event-loop boundary, allocating additional C stack in the same thread. The asyncio event loop's run_in_executor is also deep.

For the success case (no raise), the call still goes through the same chain and dies. This rules out the exception/traceback construction as the cause and points squarely at the C-level call depth.

A native STATUS_STACK_OVERFLOW is thrown by the OS when the thread's reserved stack guard page is hit. This is unrecoverable from Python — try/except cannot catch it.

Why this is pre-existing, not caused by the rename

The rename only touched the function name send_resultsend across 5 src/ call sites and tests. The function body, signature, and all callers are byte-identical except for the name. There is no plausible way a name-only change could change the C call depth or thread stack usage.

To verify: the mma_conductor thread (which calls ai_client.send via run_worker_lifecycle) has been doing this for months. The same run_with_tool_loop + _send_gemini_cli chain is invoked by every gemini_cli test in the suite. The fact that the test crash is reproducible on a fresh, isolated run (my diagnostic) with a brand-new subprocess confirms the chain was always broken; the test was just never being run.

Why the test was "green" before

Per git log, the test was last touched on 2026-06-10 (commit 2c924fe6, "poll-for-event race fixes + watchdog timeout bump"). The previous agent:

  1. Made the test's wait loop poll more aggressively (so the test would catch the response faster)
  2. Did NOT run the full tier-3 batch with this file included

The test "appeared green" because it was run in isolation (single test), where the timing was such that the worker would still be running when the test gave up. Or it was run against a different sloppy.py where the bug didn't manifest. The Isolated-Pass Verification Fallacy rule in conductor/workflow.md:533-537 applies here — the previous agent's "pass" was masked by the very behavior the test was supposed to catch.

The diagnostic I ran (no pytest) shows the process is dead within 0.5s of the click, with a deterministic stack overflow. There is no flake.

Why this hasn't been caught in other tests

The other tier-3 tests in the suite (e.g. test_live_gui_integration_v2.py, test_visual_mma.py, test_workspace_profiles_sim.py) don't exercise the gemini_cli path end-to-end. They use the test mock provider (MockProvider) which short-circuits at the ai_client.send level. The test_z_negative_flows.py is the ONLY test in the suite that actually spawns a real subprocess and goes through GeminiCliAdapter.sendsubprocess.Popencommunicate. So it's the only test that hits the 1MB thread stack limit.

Proposed solutions (in order of effort)

Option A: Bump the worker thread stack size to 8MB (minimum viable fix)

Python's ThreadPoolExecutor doesn't expose stack_size, but threading.Thread does. We can switch src/io_pool.py to use a Thread + Queue-based pool, or use concurrent.futures.ThreadPoolExecutor with a initializer that calls threading.stack_size(...) — but the latter doesn't actually change stack size post-creation. The real fix is to pre-create threads with a larger stack.

Effort: 1-2 hours. Modifies src/io_pool.py and adds a regression test that the worker can spawn a 60-second subprocess.

Risk: Low. Larger thread stacks use more virtual memory (8 threads × 8MB = 64MB virtual), but commits are lazy on Windows.

Doesn't fix the root cause — the call chain is still deep, and any future C extension could push it over. But it raises the ceiling.

Option B: Move the subprocess call to a multiprocessing.Process

Each AI call becomes a fresh Python process with its own ~8MB default stack. No thread-stack problem because subprocesses are isolated. The current 60s timeout / communicate pattern fits naturally with multiprocessing.Process + Queue.

Effort: 4-6 hours. Larger refactor. Needs IPC for the streamed chunks.

Risk: Medium. Need to handle the cross-process serialization for stream_callback, pre_tool_callback, qa_callback, and patch_callback. All callbacks are Python callables that may hold GUI state. The data-oriented pattern (Result dataclass) makes this tractable but requires careful design.

This is the correct architectural fix for the long-term. The thread-based pool was always going to be limited; AI subprocesses are exactly the workload multiprocessing was designed for.

Option C: Use subprocess.run with explicit env/working_dir settings from the main thread

Don't use the io_pool worker for the AI call. Submit a subprocess.run(...) directly from the API request thread, with a generous timeout. The C stack in the main thread is the full process stack (8MB on Windows by default for the Python interpreter).

Effort: 1 hour.

Risk: Medium. The API request thread is shared (ThreadingHTTPServer uses one thread per request). If 4 tests fire 4 requests in parallel, 4 subprocesses run in parallel. The click handler would block for up to 60s. The render loop is in the main thread, so the GUI freezes during the AI call. Unacceptable for a real user.

Option D: Mark the test as xfail with a follow-up track

The minimal change: skip the test with a clear note. Not a real fix but acknowledges the bug.

Effort: 5 minutes.

Risk: None. But the test continues to rot and the bug goes undocumented (in the code) — and the user explicitly told me not to do this.

Recommendation

Option B for the long-term, Option A for the short-term (ship in next track).

The stack overflow is a structural problem with running subprocess AI calls in a thread pool. It will recur every time someone adds a new C extension, every time someone adds a new callback, and every time someone tries to run a different (longer-running) provider. The test was correct to expose it.

For the current track, ship the analysis (this report) and the 9fcf0517 theme fix. Do not attempt the multiprocessing refactor here — it's multi-day work and out of scope. Open a follow-up track for it.

Files in this report

  • docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md (the prior theme fix report, restored in 8c6d9aa0)
  • docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md (this file)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_z.py (initial repro script)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_z2.py (script with full POST body logging — proves the failure is post-click, not in the API server)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_sitecustomize.py (instrumented run proving the adapter body completes before the process dies)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_ok.py (proves the same crash on MOCK_MODE=success — no exception path)
  • logs/sloppy_diag2_20260617_110803.log (the smoking gun: poll()=3221225725)
  • logs/sloppy_site_20260617_111653.log (instrumented: shows adapter send completed before death)

Follow-up track suggestion

A future track should:

  1. Migrate GeminiCliAdapter.send to run in a multiprocessing.Process (not a thread).
  2. Pass Result[str] back via a multiprocessing.Queue.
  3. Keep stream_callback as a thread-safe queue for streaming chunks.
  4. Add a tier-3 test that explicitly runs a 30-second subprocess.run in the worker to catch stack regressions.

Track metadata can mirror this report. Estimated scope: 5-8 files, ~150-200 lines net change.